dieterich-lab / scimodom

Sci- ModoM: A quantitative database of transcriptome-wide high-throughput RNA modification sites
https://dieterich-lab.github.io/scimodom/
GNU Affero General Public License v3.0
0 stars 0 forks source link

How does the importer handles extra header lines? #142

Closed eboileau closed 2 weeks ago

eboileau commented 3 months ago

A clear and concise description of what the bug is.

The question is, in fact, how it handles valid extra header lines?

By error, a file was uploaded which had twice the same header, except for the organism (or assembly). The "second" header was wrong, and upload failed, as if information from the second header was also used. In this case, I would have expected that all comment lines after the "true" header are ignored.

This is an unlikely scenario, but I believe we should go though the importer tests to make sure we do cover general variations on header format and length. In particular, we need to think if we should (and how to) handle the newly added tag #internal_source, which is not meant to be part of the EUF specifications.

eboileau commented 2 weeks ago

The importer reads every line starting with #, and treats it as a valid header line if it matches a given regular expression. Only the last entry is recorded (header is a dictionary), in the unlikely case that header lines are repeated, i.e. the importer doesn't care about order, repetition, etc. But we want to make sure that what comes first is what is used, so the dictionary is updated only if the key is not already present.