bcgsc / btllib

Bioinformatics Technology Lab common code library
Other
21 stars 5 forks source link

Unexpected SeqReader behaviour with multi-line fasta files #114

Closed lcoombe closed 1 year ago

lcoombe commented 1 year ago

bedtools maskfasta can create multi-line fasta files with a very large maximum number of bases per line. When I try to use SeqReader with these files, I sometimes get this error:

[ERROR] SeqReader: Unexpected character in a FASTA file.

I played around with it a bit, and it seems like SeqReader can miss identifying a fasta file as multi-line depending on the length of the first line:

(synteny) [lcoombe@hpce706 tmp]$ indexlr -k24 -w1000 celegans.fa > test.tsv

(synteny) [lcoombe@hpce706 tmp]$ seqtk seq -l10000 celegans.fa > celegans.split.fa
(synteny) [lcoombe@hpce706 tmp]$ indexlr -k24 -w1000 celegans.split.fa > test.tsv

(synteny) [lcoombe@hpce706 tmp]$ seqtk seq -l10000000 celegans.fa > celegans.split.fa
(synteny) [lcoombe@hpce706 tmp]$ indexlr -k24 -w1000 celegans.split.fa > test.tsv
[2023-09-08 15:33:41][ERROR] SeqReader: Unexpected character in a FASTA file.

Sequence lengths of my test file:

(synteny) [lcoombe@hpce706 tmp]$ cat celegans.fa |bioawk -c fastx '{print $name,length($seq)}'
CHROMOSOME_I    15072421
CHROMOSOME_II   15279323
CHROMOSOME_III  13783681
CHROMOSOME_IV   17493785
CHROMOSOME_V    20919568
CHROMOSOME_X    17718854

So a couple things I'm wondering:

vlad0x00 commented 1 year ago

As far as I remember, this line: https://github.com/bcgsc/btllib/blob/master/include/btllib/seq_reader.hpp#L170 determines how many characters are read to determine the input format type (FASTA, multiline FASTA, FASTQ, ...) You can increase the constant, it just increases memory consumption (per SeqReader object)

lcoombe commented 1 year ago

Ah OK thanks Vlad! I guess we'll have to decide if we want to make that a lot bigger, or if I just re-format the fasta files prior to using btllib... Since there isn't anyway to change that at runtime, right?

vlad0x00 commented 1 year ago

I think the code could be modified to make the number changeable at runtime, but you'd still need to decide what the number would be when you use SeqReader. Implementing the logic where it automatically decides how large the number should be, e.g. by reading at least the full first line is also possible, but might be a tad complicated to implement.

vlad0x00 commented 1 year ago

Aren't multiline FASTAs supposed to have fairly short lines? The point of them is to make it easier to read and so lines longer than something like 120 characters aren't sensible for multiline FASTAs.

lcoombe commented 1 year ago

Yes for sure I can see the complications - especially since I'm accessing SeqReader via indexlr, so it would presumably need to touch that code as well..

Yeah, the issue was super unexpected for me too - I have no idea why bedtools maskfasta has that behaviour. As far as I could tell from their code, they have the max number of bases per line equal to the length of the first sequence. It's odd, and there doesn't seem to be a way to change that.

But yeah if the solution would be too complicated on the btllib side, I can deal with it upstream (ie. adding seqtk as a dependency to make them single-line fastas)