a standardize that fixes?

rkiddy commented 7 months ago

I was hoping that clevercsv would help me with this. I get these files from the California Secretary of State's office. The data is full of junk. The fields are tab separated but people can apparently insert tabs into their data, where they are not escaped. There are sometimes newlines put into the data so that you get what looks like most of a line and then a small fragment of another. Any other such noise.

An example file is here:

https://opencalaccess.org/misc/CVR_CAMPAIGN_DISCLOSURE_CD.TSV (600k lines)

and just the top of it:

https://opencalaccess.org/misc/CVR_CAMPAIGN_DISCLOSURE_top1k_CD.TSV

I get this:

 $ clevercsv standardize --output CVR_CAMPAIGN_DISCLOSURE_std_CD.TSV CVR_CAMPAIGN_DISCLOSURE_CD.TSV
 Traceback (most recent call last):
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 67, in _parse_data
     for row in parser:
 cparser.Error: line contains NULL byte

 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
   File "/home/ray/.local/bin/clevercsv", line 8, in <module>
     sys.exit(main())
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/__main__.py", line 20, in main
     sys.exit(realmain())
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/__init__.py", line 8, in main
     return app.run()
   File "/home/ray/.local/lib/python3.10/site-packages/wilderness/application.py", line 383, in run
     return self.run_command(command)
   File "/home/ray/.local/lib/python3.10/site-packages/wilderness/application.py", line 400, in run_command
     return command.handle()
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/commands/standardize.py", line 151, in handle
     retval = self.handle_path(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/commands/standardize.py", line 173, in handle_path
     dialect = detect_dialect(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/wrappers.py", line 398, in detect_dialect
     dialect = Detector().detect(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/detect.py", line 127, in detect
     return consistency_detector.detect(sample, delimiters=delimiters)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 121, in detect
     scores = self.compute_consistency_scores(data, dialects)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 170, in compute_consistency_scores
     T = self.compute_type_score(data, dialect)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 199, in compute_type_score
     for row in parse_string(data, dialect, return_quoted=True):
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 132, in parse_data
     yield from _parse_data(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 70, in _parse_data
     raise Error(str(e))
 clevercsv.exceptions.Error: line contains NULL byte

An example of the kind of fix I have to do is this. Almost all of the rows in this file have 86 fields. But:

 line_num: 47045, fields # 81
 line_num: 47046, fields # 6

So these two lines need to be joined.

eode commented 3 months ago

Sometimes, you just have to send the data back.

eode commented 3 months ago

I know this is old, but if you really want to do this, it'll take a lot of manual munging and custom coding. Doing things like:

identifiying the exact type of each field, beyond just int and str, so that you know precisely what each row should look like
standardizing as much as you can of the data into a viable format
ensuring each line matches the known types -- when it doesn't, evaluate each problem instance to see if it's a common occurrence, and if it can be coded around. If it can't be, drop that line/line set, or edit it manually into the viable format, and move on to the next set of errors.

I'm mainly posting this because, if you've been doing that, you're still probably working on that same dataset 4 months later. The dataset just isn't valid if it's that problematic.

rkiddy commented 3 months ago

Point taken. And you are correct that I am not done fixing the data.

But I work with data from several government agencies. Unfortunately, when the government is required to publish data, they are not usually required to publish correct data. So, problems with this data are common.

I am curious to hear, if you ever want to suggest something, about what approaches to this you think would work best. I have dynamic fixes (evaluating the errors with code) and template-based fixes, and data-item-specific fixes. I despair of ever figuring out the best way to do this.

eode commented 3 months ago

One thing is that you might be able to get the data as JSON instead of CSV. JSON is a very reliable format.

But, assuming you can't, some of it will probably just be manual. Building tools to help improve the manual workflow is one possibility.

Aside from that: The goal is to narrow down the amount and location of bad data, and discover good data, and use it to find the bad fields. This involves a lot of coding, and a lot of defining each field very exactly. Sometimes, you'll get lucky and your bad data will be immediately bounded by good data. So, something like this:

str, post type category, one of "email", "blog post", "text message"
- always comes after a newline or as first field in file
- always followed by a comma
- matches one of the expected strings
string: uuid
- always hexadeximal fields of length 8, 4, 4, 4, 12
- always dash-separated fields of the expected length
- always overall length of 36
- always preceded by a comma
- always followed by a comma
text field -- undefined length, can contain enters, tabs, quotes
- screw this chaos field, don't even bother parsing it
- always preceded by a comma, always followed by a comma
date, of the format mmddyyyy
- this must be parseable into an actual date
- length 8, always
- always followed by a newline

blog post,b27795fe-b8b1-4247-9bd1-995a313bf7bb,My cat purrs incessantly and I love it,11032023
email,2bf21a0a-e773-4622-9e26-d444640f228b,And I was like, OMG sandy, wtf?
so *SHE* was like "Oh no, that's some bs, you're not putting that on me LOLOLOL.

anyways, my brother died and I'm sad, can you come over?,11032023
text message,e788ddde-e737-11ee-a9e0-15ac9046e6b9,I think that you shouldn't make light of others' sorrows.,11042023

Let's parse that real quick. You load the whole file as one string, or use seeks on a file object. Parsing goes as such:

We are expecting <newline or start of file><"blog post" or "email" or "text message">,
- we got it.
We are expecting a UUID
- we got it
We are expecting a bunch of potentially bad text next. Scan for good data following that.
- scan forward, looking char-by-char for a date of the specific format mmddyy, preceded by a comma and followed by newline or end of file
- scan forward until you have parsed the rest of the expected fields -- in this case, just ,<date><newline or eof>
- if you find another complete line without finishing the previous one, then the previous line is junk and should be logged and discarded, or at least separated out..
- otherwise, take the intervening text between the two good fields, and that's your chaos text field.
  replace commas with "{comma}" and newlines with {newline}, or use filestring[end_of_last_known_good:beginning_of_next_known_good]
- now go back and parse that (now modified) text field as though it's,<text>,
We are expecting ,<date><newline or eof -- we got it.

Same process for all three lines of the example. Obviously, simplistic example, but enough to illustrate the point, and possibly be a useful tool in your arsenal. Hopefully this helps.

eode commented 3 months ago

Not too bad. Less than 1% bad lines, even though it's also pretty incomplete.

alan-turing-institute / CleverCSV

a standardize that fixes? #119