faradayio / scrubcsv

Remove bad records from a CSV file and normalize
56 stars 7 forks source link

Be more verbose when normalizing #21

Open ngirard opened 3 years ago

ngirard commented 3 years ago

In addition to #13, it would be useful if Scrubcsv could output details about which data was normalized and for what reason. This would be enabled with such dedicated --verbose command-line option, for instance.

ngirard commented 3 years ago

As food for thought, see Csvlint.

emk commented 3 years ago

Thank you for the suggestion!

For the most part, the first pass of normalization in scrubcsv is actually performed by the csv parser. This handles things like normalization of quoting.

But there's no easy way to keep track of the decisions that csv makes under the hood.

One of the underlying issues here is that CSV is a poorly-defined format. (There's a spec. Actually, there are multiple specs and they don't always agree, as far as I know. And many real-world implementations have issues you wouldn't expect from the specs.)

scrubcsv was designed to be run on hundreds of millions of rows, or occasionally tens of billions of rows. And the input files may come from a wide variety of different sources that produce subtly corrupt CSV files. At that scale, corrupt input is pretty much a given. The primary goal of parsing is to produce standards-compliant output, and to fail if too many rows are corrupt. So the underlying goals, in order of importance, look something like:

  1. Speed.
  2. Valid output.
  3. Detection of large-scale systemic errors, as opposed to 1-in-a-million scattered errors.

I'm not necessarily opposed to adding more detailed reporting of errors, but not at the cost of performance on mostly-valid data. scrubcsv is often run on distributed batch jobs spread across dozens of servers, and performance is important.