RMI-PACTA / resources

This is a place to explore and share resources. Check out the "Issues".
https://rmi-pacta.github.io/resources/
17 stars 5 forks source link

GitHub action to check for non-binary, non-utf-8, and non-ascii #53

Open maurolepore opened 5 years ago

maurolepore commented 5 years ago

every time a push is made, it searches the entire repo for files that are non-binary, non-utf-8, and non-ascii, and fails with a list of files if it finds any (otherwise passes)

-- @cjyetman

I'll keep the thread history below because the development over time might be interesting/instructive, but this is the version I would advise to use...

name: check file encodings in PR
on: [pull_request]
jobs:
  file-encoding:
    name: file encooding check
    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: list all changed files
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...${{ github.sha }})
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          file --mime "${files[@]}"
      - name: list all changed files with the wrong encoding
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...${{ github.sha }})
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          ! file --mime "${files[@]}" | grep -v "charset=utf-8\|charset=us-ascii\|charset=binary\| (No such file or directory)$"

This was the originally proposed version...

name: CI

on: [push]

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v1
      - name: list the directory
        run: echo '! find $GITHUB_WORKSPACE -type f -exec file --mime {} \; | grep -v "charset=binary$" | grep -v "charset=us-ascii$" | grep -v "charset=utf-8$"' | bash

--

Thanks @cjyetman

cjyetman commented 3 years ago

this is a bit improved.... it only runs on a PR, and it only looks at files that are changed from the PR branch to main/master still could use some improvement/optimization...

name: PR check encoding
on: [pull_request]
jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: list all changed files
        run: |
          start=$(git rev-list HEAD | tail -1)
          end=$(git rev-list HEAD | head -1)
          files=$(git diff --name-only $start..$end)
          file --mime $files
      - name: list all changed files with the wrong encoding
        run: |
          start=$(git rev-list HEAD | tail -1)
          end=$(git rev-list HEAD | head -1)
          files=$(git diff --name-only $start..$end)
          ! file --mime $files | grep -v "charset=utf-8\|charset=us-ascii\|charset=binary"
cjyetman commented 3 years ago

private repos seem to end up with a different git structure when checked out into the Github Action runner, so this one is modified to check only the files changed in the last commit, which works ok for that commit, but might misguide you into thinking everything with the PR is ok

the last shell code running on a separate line from run: | is critical, otherwise the negation ! seems to trigger something else with the YAML processing

name: PR check encoding
on: [pull_request]
jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: list all changed files
        run: |
          file --mime $(git diff --name-only HEAD^1)
      - name: list all changed files with the wrong encoding
        run: |
          ! file --mime $(git diff --name-only HEAD^1) | grep -v "charset=utf-8\|charset=us-ascii\|charset=binary"
cjyetman commented 3 years ago

ok, I think I finally figured out how to do this properly using GitHub environment variables that define the git refs relevant to the PR that triggers this action...

name: check file encodings in PR
on: [pull_request]
jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: list all changed files
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...origin/$GITHUB_HEAD_REF)
          file --mime $files
      - name: list all changed files with the wrong encoding
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...origin/$GITHUB_HEAD_REF)
          ! file --mime $files | grep -v "charset=utf-8\|charset=us-ascii\|charset=binary"
cjyetman commented 3 years ago

and this will make it work with filenames that have spaces in it...

name: check file encodings in PR
on: [pull_request]
jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: list all changed files
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...origin/$GITHUB_HEAD_REF)
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          file --mime "${files[@]}"
      - name: list all changed files with the wrong encoding
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...origin/$GITHUB_HEAD_REF)
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          ! file --mime "${files[@]}" | grep -v "charset=utf-8\|charset=us-ascii\|charset=binary"
cjyetman commented 3 years ago

replacing origin/$GITHUB_HEAD_REF with ${{ github.sha }} enables this action to determine the diff even if the PR originates from a fork (versus a feature branch in the same repo)

name: check file encodings in PR
on: [pull_request]
jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: list all changed files
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...${{ github.sha }})
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          file --mime "${files[@]}"
      - name: list all changed files with the wrong encoding
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...${{ github.sha }})
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          ! file --mime "${files[@]}" | grep -v "charset=utf-8\|charset=us-ascii\|charset=binary"
cjyetman commented 3 years ago

added a new fix to ignore files that have been completely removed in a PR...

name: check file encodings in PR
on: [pull_request]
jobs:
  file-encoding:
    name: file encooding check
    runs-on: ubuntu-latest

    steps:
      - name: run the checkout action
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: list all changed files
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...${{ github.sha }})
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          file --mime "${files[@]}"
      - name: list all changed files with the wrong encoding
        run: |
          files=$(git diff --name-only origin/$GITHUB_BASE_REF...${{ github.sha }})
          IFS=$'\n'; files=($files); unset IFS;  # split the string into an array
          ! file --mime "${files[@]}" | grep -v "charset=utf-8\|charset=us-ascii\|charset=binary\| (No such file or directory)$"