Long fasta headers with white spaces may interfere with downstream processing

Werner0 commented 1 year ago

I downloaded a multi-fasta file from Genbank and passed it as input to genomad. I only get expected results with genomad when I rename the fasta headers in the file.

Rename command: awk '/^>/{print ">Seq"++i; next}{print}' input.fasta > output.fasta

Original header:

>gi|29366675|ref|NC_000866.4| Enterobacteria phage T4, complete genome

New header:

>Seq1

If I don't rename fasta headers prior to running genomad, the *taxonomy.tsv file created by 'genomad end-to-end` is empty. Fasta headers longer than 30 characters or with white spaces can cause bugs in downstream processing because some software tools have limitations on the maximum length of header lines they can handle, or they may use whitespace as a delimiter to parse the header line and extract specific information. As a result, headers that exceed these limits may cause errors or unexpected behavior in downstream processing tools.

apcamargo commented 1 year ago

Hi @Werner0.

The length of the FASTA header shouldn't be an issue. That said, geNomad (and most tools) expect that the IDs (that is, the name of the sequence before the first whitespace) are unique.

I've executed geNomad on a good chunk of GenBank sequences and never had an issue. Can you provide an example?

Werner0 commented 1 year ago

Hi @apcamargo

Here's a link to the fasta file: https://zenodo.org/record/7922781

apcamargo commented 1 year ago

I found the issue. It is due to the way MMseqs2 parses the headers, which removes everything after the first | when it detects that the header is from RefSeq. This conflicts with the way headers are parsed by the other components.

I'll release a fix later today or tomorrow. Thank you for your report!

apcamargo commented 1 year ago

I just pushed the new release to PyPI. It should also be available in Bioconda soon.

apcamargo / genomad

Long fasta headers with white spaces may interfere with downstream processing #20