alan-turing-institute / CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
https://clevercsv.readthedocs.io
MIT License
1.24k stars 70 forks source link

a standardize that fixes? #119

Closed rkiddy closed 3 months ago

rkiddy commented 7 months ago

I was hoping that clevercsv would help me with this. I get these files from the California Secretary of State's office. The data is full of junk. The fields are tab separated but people can apparently insert tabs into their data, where they are not escaped. There are sometimes newlines put into the data so that you get what looks like most of a line and then a small fragment of another. Any other such noise.

An example file is here:

https://opencalaccess.org/misc/CVR_CAMPAIGN_DISCLOSURE_CD.TSV (600k lines)

and just the top of it:

https://opencalaccess.org/misc/CVR_CAMPAIGN_DISCLOSURE_top1k_CD.TSV

I get this:

 $ clevercsv standardize --output CVR_CAMPAIGN_DISCLOSURE_std_CD.TSV CVR_CAMPAIGN_DISCLOSURE_CD.TSV
 Traceback (most recent call last):
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 67, in _parse_data
     for row in parser:
 cparser.Error: line contains NULL byte

 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
   File "/home/ray/.local/bin/clevercsv", line 8, in <module>
     sys.exit(main())
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/__main__.py", line 20, in main
     sys.exit(realmain())
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/__init__.py", line 8, in main
     return app.run()
   File "/home/ray/.local/lib/python3.10/site-packages/wilderness/application.py", line 383, in run
     return self.run_command(command)
   File "/home/ray/.local/lib/python3.10/site-packages/wilderness/application.py", line 400, in run_command
     return command.handle()
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/commands/standardize.py", line 151, in handle
     retval = self.handle_path(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/commands/standardize.py", line 173, in handle_path
     dialect = detect_dialect(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/wrappers.py", line 398, in detect_dialect
     dialect = Detector().detect(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/detect.py", line 127, in detect
     return consistency_detector.detect(sample, delimiters=delimiters)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 121, in detect
     scores = self.compute_consistency_scores(data, dialects)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 170, in compute_consistency_scores
     T = self.compute_type_score(data, dialect)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 199, in compute_type_score
     for row in parse_string(data, dialect, return_quoted=True):
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 132, in parse_data
     yield from _parse_data(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 70, in _parse_data
     raise Error(str(e))
 clevercsv.exceptions.Error: line contains NULL byte

An example of the kind of fix I have to do is this. Almost all of the rows in this file have 86 fields. But:

 line_num: 47045, fields # 81
 line_num: 47046, fields # 6

So these two lines need to be joined.

eode commented 3 months ago

Sometimes, you just have to send the data back.

eode commented 3 months ago

I know this is old, but if you really want to do this, it'll take a lot of manual munging and custom coding. Doing things like:

I'm mainly posting this because, if you've been doing that, you're still probably working on that same dataset 4 months later. The dataset just isn't valid if it's that problematic.

rkiddy commented 3 months ago

Point taken. And you are correct that I am not done fixing the data.

But I work with data from several government agencies. Unfortunately, when the government is required to publish data, they are not usually required to publish correct data. So, problems with this data are common.

I am curious to hear, if you ever want to suggest something, about what approaches to this you think would work best. I have dynamic fixes (evaluating the errors with code) and template-based fixes, and data-item-specific fixes. I despair of ever figuring out the best way to do this.

eode commented 3 months ago

One thing is that you might be able to get the data as JSON instead of CSV. JSON is a very reliable format.

But, assuming you can't, some of it will probably just be manual. Building tools to help improve the manual workflow is one possibility.

Aside from that: The goal is to narrow down the amount and location of bad data, and discover good data, and use it to find the bad fields. This involves a lot of coding, and a lot of defining each field very exactly. Sometimes, you'll get lucky and your bad data will be immediately bounded by good data. So, something like this:

blog post,b27795fe-b8b1-4247-9bd1-995a313bf7bb,My cat purrs incessantly and I love it,11032023
email,2bf21a0a-e773-4622-9e26-d444640f228b,And I was like, OMG sandy, wtf?
so *SHE* was like "Oh no, that's some bs, you're not putting that on me LOLOLOL.

anyways, my brother died and I'm sad, can you come over?,11032023
text message,e788ddde-e737-11ee-a9e0-15ac9046e6b9,I think that you shouldn't make light of others' sorrows.,11042023

Let's parse that real quick. You load the whole file as one string, or use seeks on a file object. Parsing goes as such:

Same process for all three lines of the example. Obviously, simplistic example, but enough to illustrate the point, and possibly be a useful tool in your arsenal. Hopefully this helps.

eode commented 3 months ago

Not too bad. Less than 1% bad lines, even though it's also pretty incomplete.