Closed Werner0 closed 1 year ago
Hi @Werner0.
The length of the FASTA header shouldn't be an issue. That said, geNomad (and most tools) expect that the IDs (that is, the name of the sequence before the first whitespace) are unique.
I've executed geNomad on a good chunk of GenBank sequences and never had an issue. Can you provide an example?
Hi @apcamargo
Here's a link to the fasta file: https://zenodo.org/record/7922781
I found the issue. It is due to the way MMseqs2 parses the headers, which removes everything after the first |
when it detects that the header is from RefSeq. This conflicts with the way headers are parsed by the other components.
I'll release a fix later today or tomorrow. Thank you for your report!
I just pushed the new release to PyPI. It should also be available in Bioconda soon.
I downloaded a multi-fasta file from Genbank and passed it as input to genomad. I only get expected results with genomad when I rename the fasta headers in the file.
Rename command:
awk '/^>/{print ">Seq"++i; next}{print}' input.fasta > output.fasta
Original header:
New header:
If I don't rename fasta headers prior to running genomad, the *taxonomy.tsv file created by 'genomad end-to-end` is empty. Fasta headers longer than 30 characters or with white spaces can cause bugs in downstream processing because some software tools have limitations on the maximum length of header lines they can handle, or they may use whitespace as a delimiter to parse the header line and extract specific information. As a result, headers that exceed these limits may cause errors or unexpected behavior in downstream processing tools.