alan-turing-institute / CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
https://clevercsv.readthedocs.io
MIT License
1.24k stars 70 forks source link

Reduce median dialect detection time by ~64% #96

Closed GjjvdBurg closed 1 year ago

GjjvdBurg commented 1 year ago

This PR brings a redesign of the consistency score calculation to allow for caching of the type detection results. This reduces the median runtime by 64% compared to the current master branch (computed similarly as in #92). The average runtime on our test set is reduced by ~32% compared to the current master branch. It is likely that further performance improvements are possible.

Compared to v0.7.6, CleverCSV is now ~52% faster on average, and median runtime is reduced by 68%.