alan-turing-institute / CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
https://clevercsv.readthedocs.io
MIT License
1.25k stars 72 forks source link

Header Detection Improvement #45

Open ben-bitdotio opened 3 years ago

ben-bitdotio commented 3 years ago

Summary: Added resolution between float and int types so they aren't recognized as incompatible.

Tests: Verified that the following file is correctly predicted to have a header via Detector.has_header().

col1,col2,col3
hello,"hello world", 1.2
world,"hello world", 1.2
test,"hello world 您", 1

Update: I will be unable to contribute to this discussion under this account after today. It appears that I'm unable to modify the assignees list but @ellie-bitio should be able to follow up if necessary.

GjjvdBurg commented 3 years ago

Thanks for opening an issue on this and creating a PR @ben-bitdotio! The header detection code could definitely be improved, but I've been waiting until I have a dataset to evaluate the accuracy of different algorithms. This fix seems pretty harmless though, so I think we can merge it for now.

Would you be able to add a unit test to tests/test_unit/test_detect.py that fails without your fix but passes with your fix? That would be a nice confirmation that it works as expected (the example you give above could work as a test case). Thank you!

(cc-ing @ellie-bitio as suggested)