MontgomeryLab / tinyRNA

tinyRNA provides an all-in-one solution for precision analysis of sRNA-seq data. At the core of tinyRNA is a highly flexible counting utility, tiny-count, that allows for hierarchical assignment of reads to features based on positional information, extent of feature overlap, 5’ nucleotide, length, and strandedness.
GNU General Public License v3.0
1 stars 1 forks source link

tiny-config: CSVReader now determines delimiter through heuristics #329

Closed AlexTate closed 8 months ago

AlexTate commented 8 months ago

CSV delimiter is now determined on a per-file basis using heuristics provided by the csv module. RFC 4180 isn't strictly followed in the real world, and the delimiter character can vary depending on the locale where the document was last edited. For example, locales that use a comma as a decimal separator will commonly use a semicolon delimiter.

The csv module uses the term "dialect" to describe the group of formatting attributes that are likely to vary, which includes the delimiter. The heuristics that determine the delimiter can also determine these other attributes on a per-document basis, which sounds great on paper. However, during testing I found that the quote heuristics in particular can produce false results for common situations, for example a Features Sheet with one row in which one field is wrapped in embedded quotes. I tested the CSV in this example by writing it with default dialect in csv.DictWriter and reading it with a fully heuristic dialect in our CSVReader, which resulted in parsing errors due to the heuristic's false assumptions about quotes. For this reason, I've decided to use the default dialect in CSVReader with only the delimiter determined by heuristics.

Closes #328

AlexTate commented 8 months ago

Converting to draft to add a couple of other sanity checks to CSVReader:

AlexTate commented 8 months ago

In addition to the items above, this evening's updates contain some future-proofing package restructuring for CSVReader and more reliability improvements.

Delimiter heuristics fail when column count is inconsistent between rows, which raises an exception related to the delimiter that isn't helpful to users. We want to make best effort when parsing the header so the proper exception can be raised. To accomplish this I've added another routine that guesses the delimiter by removing expected header substrings from the first line and choosing the most common remaining character. This approach can still fail if the wrong CSV is provided AND it has malformed rows.

Additionally, the RuleCounts class now uses CSVReader. This marks the last "rogue" use of csv.DictReader(), which is a win for maintainability.

taimontgomery commented 8 months ago

I wasn't able to reproduce the issue this pull request addresses but I confirmed that the end to end pipeline still works with standard CSV files and generates the expected error messages.