alan-turing-institute / CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
https://clevercsv.readthedocs.io
MIT License
1.25k stars 72 forks source link

Make CleverCSV a pre-commit hook #24

Closed lcnittl closed 3 years ago

lcnittl commented 4 years ago

Would you consider making CleverCSV a pre-commit hook, by adding a .pre-commit-hooks.yaml file to your repository? This would allow people to have csv files in their repositories to automatically check them whenever they want to commit them.

The only thing missing to make it work is, as I understand:

I can file a PR for the .pre-commit-hooks.yaml, if you agree (after the above points are implemented).

Thanks for your consideration!

GjjvdBurg commented 4 years ago

Hi @lcnittl, thanks for the suggestion, this sounds like an interesting application of CleverCSV!

I'd be happy to add a pre-commit hook to the repo. If I understand your suggestion correctly, it would be to use the clevercsv standardize functionality to ensure CSV files are correctly formatted, right?

lcnittl commented 4 years ago

@GjjvdBurg

If I understand your suggestion correctly, it would be to use the clevercsv standardize functionality to ensure CSV files are correctly formatted, right?

Yes, exactly! This would then function similar to black or prettier, just for CSV files, to eg minimize diffs.

GjjvdBurg commented 4 years ago

Cool, alright that sounds good, I'll start a PR soon with the necessary additions to the standardize command

lcnittl commented 4 years ago

Nice - Thanks for working on it!

GjjvdBurg commented 3 years ago

Hi @lcnittl

I've just merged #26, which adds support for standardizing multiple files and adds the --in-place flag. The return code of clevercsv standardize is 2 when any of the input files is changed and 0 if none of them are. Hopefully this is sufficient for implementing the pre-commit hook!

lcnittl commented 3 years ago

Cool, thanks a lot! Shall I make a PR for a .pre-commit-hooks.yaml file?

GjjvdBurg commented 3 years ago

Yeah, that would be great, thanks!

lcnittl commented 3 years ago

I have created a fork and added the .pre-commit-hooks.yaml file there. However, I am having problems installing CleverCSV with pip and the git+https URI scheme (To my understanding, this is the way pre-commit would install the package too):

(.venv) PS C:\Users\lcnittl\playground> pip install --upgrade git+https://github.com/lcnittl/CleverCSV@patch-1
Collecting git+https://github.com/lcnittl/CleverCSV@patch-1
  Cloning https://github.com/lcnittl/CleverCSV (to revision patch-1) to c:\users\lcnittl\appdata\local\temp\pip-req-build-j5a594ud
Collecting chardet>=3.0
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting regex>=2018.11
  Using cached regex-2020.10.15-cp38-cp38-win_amd64.whl (268 kB)
Building wheels for collected packages: clevercsv
  Building wheel for clevercsv (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\lcnittl\playground\.venv\Scripts\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\lcnittl\\AppData\\Local\\Temp\\pip-req-build-j5a594ud\\setup.py'"'"'; __file__='"'"'C:\\Users\\lcnittl\\AppData\\Local\\Temp\\pip-req-build-j5a594ud\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\lcnittl\AppData\Local\Temp\pip-wheel-p580azgi'
       cwd: C:\Users\lcnittl\AppData\Local\Temp\pip-req-build-j5a594ud\
  Complete output (70 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.8
  creating build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\break_ties.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\consistency.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\cparser_util.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\detect.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\detect_pattern.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\detect_type.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\dialect.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\dict_read_write.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\escape.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\exceptions.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\normal_form.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\potential_dialects.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\read.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\utils.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\wrappers.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\write.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\_optional.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\__init__.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\__main__.py -> build\lib.win-amd64-3.8\clevercsv
  copying clevercsv\__version__.py -> build\lib.win-amd64-3.8\clevercsv
  creating build\lib.win-amd64-3.8\clevercsv\console
  copying clevercsv\console\application.py -> build\lib.win-amd64-3.8\clevercsv\console
  copying clevercsv\console\config.py -> build\lib.win-amd64-3.8\clevercsv\console
  copying clevercsv\console\__init__.py -> build\lib.win-amd64-3.8\clevercsv\console
  creating build\lib.win-amd64-3.8\clevercsv\console\commands
  copying clevercsv\console\commands\code.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
  copying clevercsv\console\commands\detect.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
  copying clevercsv\console\commands\explore.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
  copying clevercsv\console\commands\standardize.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
  copying clevercsv\console\commands\view.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
  copying clevercsv\console\commands\_utils.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
  copying clevercsv\console\commands\__init__.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
  running egg_info
  creating clevercsv.egg-info
  writing clevercsv.egg-info\PKG-INFO
  writing dependency_links to clevercsv.egg-info\dependency_links.txt
  writing entry points to clevercsv.egg-info\entry_points.txt
  writing requirements to clevercsv.egg-info\requires.txt
  writing top-level names to clevercsv.egg-info\top_level.txt
  writing manifest file 'clevercsv.egg-info\SOURCES.txt'
  reading manifest file 'clevercsv.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  warning: no files found matching 'requirements.txt'
  warning: no files found matching '*' under directory 'bin'
  no previously-included directories found matching 'tests\test_integration'
  warning: no previously-included files found matching 'Makefile'
  warning: no previously-included files found matching '.gitignore'
  warning: no previously-included files found matching '.travis.yml'
  warning: no previously-included files found matching '.readthedocs.yml'
  warning: no previously-included files found matching 'make_release.py'
  warning: no previously-included files found matching 'cgrep'
  warning: no previously-included files found matching 'vgrep'
  no previously-included directories found matching 'notes'
  no previously-included directories found matching 'auxiliary'
  writing manifest file 'clevercsv.egg-info\SOURCES.txt'
  running build_ext
  building 'clevercsv.cparser' extension
  creating build\temp.win-amd64-3.8
  creating build\temp.win-amd64-3.8\Release
  creating build\temp.win-amd64-3.8\Release\src
  C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\lcnittl\playground\.venv\include -Ic:\python\python38\include -Ic:\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.16.27023\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /Tcsrc/cparser.c /Fobuild\temp.win-amd64-3.8\Release\src/cparser.obj
  cparser.c
  c:\python\python38\include\pyconfig.h(206): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\BuildTools\\VC\\Tools\\MSVC\\14.16.27023\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
  ----------------------------------------
  ERROR: Failed building wheel for clevercsv
  Running setup.py clean for clevercsv
Failed to build clevercsv
Installing collected packages: chardet, regex, clevercsv
    Running setup.py install for clevercsv ... error
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\lcnittl\playground\.venv\Scripts\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\lcnittl\\AppData\\Local\\Temp\\pip-req-build-j5a594ud\\setup.py'"'"'; __file__='"'"'C:\\Users\\lcnittl\\AppData\\Local\\Temp\\pip-req-build-j5a594ud\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\lcnittl\AppData\Local\Temp\pip-record-prlevfmk\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\lcnittl\playground\.venv\include\site\python3.8\clevercsv'
         cwd: C:\Users\lcnittl\AppData\Local\Temp\pip-req-build-j5a594ud\
    Complete output (68 lines):
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.8
    creating build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\break_ties.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\consistency.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\cparser_util.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\detect.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\detect_pattern.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\detect_type.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\dialect.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\dict_read_write.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\escape.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\exceptions.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\normal_form.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\potential_dialects.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\read.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\utils.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\wrappers.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\write.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\_optional.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\__init__.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\__main__.py -> build\lib.win-amd64-3.8\clevercsv
    copying clevercsv\__version__.py -> build\lib.win-amd64-3.8\clevercsv
    creating build\lib.win-amd64-3.8\clevercsv\console
    copying clevercsv\console\application.py -> build\lib.win-amd64-3.8\clevercsv\console
    copying clevercsv\console\config.py -> build\lib.win-amd64-3.8\clevercsv\console
    copying clevercsv\console\__init__.py -> build\lib.win-amd64-3.8\clevercsv\console
    creating build\lib.win-amd64-3.8\clevercsv\console\commands
    copying clevercsv\console\commands\code.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
    copying clevercsv\console\commands\detect.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
    copying clevercsv\console\commands\explore.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
    copying clevercsv\console\commands\standardize.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
    copying clevercsv\console\commands\view.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
    copying clevercsv\console\commands\_utils.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
    copying clevercsv\console\commands\__init__.py -> build\lib.win-amd64-3.8\clevercsv\console\commands
    running egg_info
    writing clevercsv.egg-info\PKG-INFO
    writing dependency_links to clevercsv.egg-info\dependency_links.txt
    writing entry points to clevercsv.egg-info\entry_points.txt
    writing requirements to clevercsv.egg-info\requires.txt
    writing top-level names to clevercsv.egg-info\top_level.txt
    reading manifest file 'clevercsv.egg-info\SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no files found matching 'requirements.txt'
    warning: no files found matching '*' under directory 'bin'
    no previously-included directories found matching 'tests\test_integration'
    warning: no previously-included files found matching 'Makefile'
    warning: no previously-included files found matching '.gitignore'
    warning: no previously-included files found matching '.travis.yml'
    warning: no previously-included files found matching '.readthedocs.yml'
    warning: no previously-included files found matching 'make_release.py'
    warning: no previously-included files found matching 'cgrep'
    warning: no previously-included files found matching 'vgrep'
    no previously-included directories found matching 'notes'
    no previously-included directories found matching 'auxiliary'
    writing manifest file 'clevercsv.egg-info\SOURCES.txt'
    running build_ext
    building 'clevercsv.cparser' extension
    creating build\temp.win-amd64-3.8
    creating build\temp.win-amd64-3.8\Release
    creating build\temp.win-amd64-3.8\Release\src
    C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\lcnittl\playground\.venv\include -Ic:\python\python38\include -Ic:\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.16.27023\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /Tcsrc/cparser.c /Fobuild\temp.win-amd64-3.8\Release\src/cparser.obj
    cparser.c
    c:\python\python38\include\pyconfig.h(206): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory
    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\BuildTools\\VC\\Tools\\MSVC\\14.16.27023\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\lcnittl\playground\.venv\Scripts\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\lcnittl\\AppData\\Local\\Temp\\pip-req-build-j5a594ud\\setup.py'"'"'; __file__='"'"'C:\\Users\\lcnittl\\AppData\\Local\\Temp\\pip-req-build-j5a594ud\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\lcnittl\AppData\Local\Temp\pip-record-prlevfmk\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\lcnittl\playground\.venv\include\site\python3.8\clevercsv' Check the logs for full command output.
(.venv) PS C:\Users\lcnittl\playground>

Would you have a (quick) solution for this?

Thanks, Christian

GjjvdBurg commented 3 years ago

Hi Christian,

From what I can see there's something going wrong in the compilation step, with a missing file called basetsd.h. A quick search suggests that this may have to do with your visual studio installation.

In general, it wouldn't be great if the pre-commit hook would require compiling CleverCSV from scratch, as this would probably limit the number of users (we distribute wheels for CleverCSV to avoid the need for compilation). However to test your fork it does make sense to install from Github and thus you'd need to compile.

Let me know if you can't get the compilation step to work. An alternative could be that I try to write the pre-commit hook myself and perhaps you could give feedback.

GjjvdBurg commented 3 years ago

@lcnittl I've just opened a PR with the instructions on adding a pre-commit hook for CleverCSV (#30). Would you mind checking whether this is what you had in mind?

lcnittl commented 3 years ago

@GjjvdBurg Thanks for the work on it and sorry for my nearly 3 weeks of absence!

Trying to find a workaround for the install issue was on my list but I never managed to get to it. And I did not want to create the PR with that still missing.

lcnittl commented 3 years ago

Would you mind checking whether this is what you had in mind?

Kind of, but I think there is something mixed up with the hooks and config file (both are needed). I will test if it works with the "local repo" solution you found and adjust my hooks file, and then open a PR.

The pre-commit-config .yaml file of #30 needs a slightly different layout I think, I will comment in your PR as soon as I have it up and working.

GjjvdBurg commented 3 years ago

Thanks for the quick response. The reason we don't have a .pre-commit-hooks.yaml file is because we use the repo: local setting (https://pre-commit.com/#repository-local-hooks). With that approach the hooks file is no longer needed.

Here's how I tested it:

$ mkdir /tmp/testhook
$ cp example/* /tmp/testhook        # copy examples from CleverCSV
$ cd /tmp/testhook
$ git init
# add .pre-commit-config.yaml as in the README of #30 
$ pre-commit install
pre-commit installed at .git/hooks/pre-commit
$ git add *.csv
$ git commit -am "attempt to commit bad csvs"
CleverCSV Standardize....................................................Failed
- hook id: clevercsv-standardize
- exit code: 2
- files were modified by this hook
lcnittl commented 3 years ago

Yes, indeed, I realized this now. The "problem" with this approach is, that the basic configuration of the hook is now decentralized - so every user would have to update their config file if there are changes to the API (any new keyword needed or the like).

PS: Having a potential alternative. Need to test it though

GjjvdBurg commented 3 years ago

That's a good point. I think we may have the same alternative, how about using the following hook file:

- id: clevercsv-standardize
  name: CleverCSV Standardize
  entry: clevercsv standardize --in-place
  language: python
  language_version: python3
  types: ['csv']
  additional_dependencies: ['clevercsv[full]']

Seems to work with pre-commit try-repo

lcnittl commented 3 years ago

I got some input from the maintainer of pre-commit (very, very nice guy!!). The problem with having the hook file is that in any case pip install . would be run. I just tested this on my machine, and it again failed on compilation.

The alternative would be to have a "mirror-repo" that solely consists of a hooks file and a setup.py, which install clevercsv from PyPI. There is a tool that auto-creates the 2 files ( https://github.com/pre-commit/pre-commit-mirror-maker ).

This is what I used:

pre-commit-mirror mirrors-CleverCSV --language python --package-name clevercsv --entry "clevercsv standardize --in-place" --types csv

I just have to upload that to a new repo and test if it really works :D

lcnittl commented 3 years ago

Some comments from the maintainer, stating why we should not use additional_dependencies in this very case:

  • language_version: python3 ~generally shouldn't be needed, pre-commit only supports python3 now and will default the current interpreter version
  • the additional_dependencies should be pinned to a version for repeatability, this is what pre-commit-mirror-maker does when it iterates through versions
  • it's ~slightly better to have the pypi dependency in a trivial setup.py (as pre-commit-mirror-maker does) such that the consumer can freely use additional_dependencies to add plugins and such) -- this isn't possible in all languages (for example js) so sometimes additional_dependencies is used by pre-commit-mirror-maker in this way
GjjvdBurg commented 3 years ago

That's great, thanks for asking him and figuring this out! So if I understand correctly the user would point their .pre-commit-config.yaml to the mirror repo, which contains a minimal setup.py that depends on CleverCSV?

I'll try the pre-commit-mirror-maker myself as well but let me know what you find.

One caveat is that the versions seem to go all the way back to v0.1.2, which didn't have a command line interface yet. So maybe we don't need to use mirror-maker but can just apply the same idea to a separate repo.

lcnittl commented 3 years ago

One caveat is that the versions seem to go all the way back to v0.1.2, which didn't have a command line interface yet. So maybe we don't need to use mirror-maker but can just apply the same idea to a separate repo.

I think the idea behind this is, that you can use it to then update the mirror in the future (It will add all new tags).

However, so far it seems to be not possible to use clevercsv[full] as package name (just clevercsv) - so manual adjustment would be needed. I will file an issue there.

You can test it with this:

repos:
  - repo: https://github.com/lcnittl/mirros-CleverCSV
    rev: v0.6.6
    hooks:
      - id: clevercsv
GjjvdBurg commented 3 years ago

Cool, I can confirm that your repo works!

I'm trying to think about how to set this up so it isn't a lot of effort to maintain going forward. I think the setup.py of the mirror repo can be simplified to pull the version info from the .version file (similar to what we do in this repo), and then I can add a bash script or so to update that .version file and do the git tag whenever I do a new release of CleverCSV.

I'll pick this back up tomorrow, thanks for your help!

lcnittl commented 3 years ago

If the extras install works in the future, I think the easiest might be to have a batch/shell script that executes the mirroring command, together with git push:

pre-commit-mirror mirrors-CleverCSV --language python --package-name "clevercsv[full]" --entry "clevercsv standardize" --args="--in-place" --types csv
git push
git push --tags

Then updating would be a single "click", and could even run as a cron job/scheduled task or similar. (And would save the hassle to make a separate script - besides the above one 🤣).

PS: The pre-commit maintainer suggests to have --in-place as argument rather than entry. This is certainly a possibility and would allow users to turn that off by resetting args, so I changed it above now.

GjjvdBurg commented 3 years ago

Hi @lcnittl, I've created a dummy package repo here and I've updated the instructions in the readme in #30. I've chosen to go for a bash script to bump the version as it is simpler and therefore hopefully easier to maintain in the long run.

The dummy package has a test suite to verify the hook works, but I'd appreciate it if you could double check :)

lcnittl commented 3 years ago

The dummy package has a test suite to verify the hook works, but I'd appreciate it if you could double check :)

Can confirm that it works for me without problems.

Thanks a lot for all your effort!

(Not closing as merging of #30 will do so)

lcnittl commented 3 years ago

PS: I asked to include the CleverCSV hooks in the official pre-commit list of hooks: https://github.com/pre-commit/pre-commit.com/pull/413 (got already merged) to enhance visibility. It is the first CSV hook there :smile:

GjjvdBurg commented 3 years ago

Ah that's cool, thanks for that! :smile: