Closed AlexTate closed 8 months ago
Converting to draft to add a couple of other sanity checks to CSVReader:
In addition to the items above, this evening's updates contain some future-proofing package restructuring for CSVReader and more reliability improvements.
Delimiter heuristics fail when column count is inconsistent between rows, which raises an exception related to the delimiter that isn't helpful to users. We want to make best effort when parsing the header so the proper exception can be raised. To accomplish this I've added another routine that guesses the delimiter by removing expected header substrings from the first line and choosing the most common remaining character. This approach can still fail if the wrong CSV is provided AND it has malformed rows.
Additionally, the RuleCounts class now uses CSVReader. This marks the last "rogue" use of csv.DictReader(), which is a win for maintainability.
I wasn't able to reproduce the issue this pull request addresses but I confirmed that the end to end pipeline still works with standard CSV files and generates the expected error messages.
CSV delimiter is now determined on a per-file basis using heuristics provided by the
csv
module. RFC 4180 isn't strictly followed in the real world, and the delimiter character can vary depending on the locale where the document was last edited. For example, locales that use a comma as a decimal separator will commonly use a semicolon delimiter.The
csv
module uses the term "dialect" to describe the group of formatting attributes that are likely to vary, which includes the delimiter. The heuristics that determine the delimiter can also determine these other attributes on a per-document basis, which sounds great on paper. However, during testing I found that the quote heuristics in particular can produce false results for common situations, for example a Features Sheet with one row in which one field is wrapped in embedded quotes. I tested the CSV in this example by writing it with default dialect incsv.DictWriter
and reading it with a fully heuristic dialect in ourCSVReader
, which resulted in parsing errors due to the heuristic's false assumptions about quotes. For this reason, I've decided to use the default dialect inCSVReader
with only the delimiter determined by heuristics.Closes #328