Open ngirard opened 3 years ago
Thank you for the suggestion!
For the most part, the first pass of normalization in scrubcsv
is actually performed by the csv
parser. This handles things like normalization of quoting.
But there's no easy way to keep track of the decisions that csv
makes under the hood.
One of the underlying issues here is that CSV is a poorly-defined format. (There's a spec. Actually, there are multiple specs and they don't always agree, as far as I know. And many real-world implementations have issues you wouldn't expect from the specs.)
scrubcsv
was designed to be run on hundreds of millions of rows, or occasionally tens of billions of rows. And the input files may come from a wide variety of different sources that produce subtly corrupt CSV files. At that scale, corrupt input is pretty much a given. The primary goal of parsing is to produce standards-compliant output, and to fail if too many rows are corrupt. So the underlying goals, in order of importance, look something like:
I'm not necessarily opposed to adding more detailed reporting of errors, but not at the cost of performance on mostly-valid data. scrubcsv
is often run on distributed batch jobs spread across dozens of servers, and performance is important.
In addition to #13, it would be useful if Scrubcsv could output details about which data was normalized and for what reason. This would be enabled with such dedicated
--verbose
command-line option, for instance.