Closed lcoombe closed 1 year ago
As far as I remember, this line: https://github.com/bcgsc/btllib/blob/master/include/btllib/seq_reader.hpp#L170 determines how many characters are read to determine the input format type (FASTA, multiline FASTA, FASTQ, ...) You can increase the constant, it just increases memory consumption (per SeqReader object)
Ah OK thanks Vlad! I guess we'll have to decide if we want to make that a lot bigger, or if I just re-format the fasta files prior to using btllib... Since there isn't anyway to change that at runtime, right?
I think the code could be modified to make the number changeable at runtime, but you'd still need to decide what the number would be when you use SeqReader. Implementing the logic where it automatically decides how large the number should be, e.g. by reading at least the full first line is also possible, but might be a tad complicated to implement.
Aren't multiline FASTAs supposed to have fairly short lines? The point of them is to make it easier to read and so lines longer than something like 120 characters aren't sensible for multiline FASTAs.
Yes for sure I can see the complications - especially since I'm accessing SeqReader via indexlr, so it would presumably need to touch that code as well..
Yeah, the issue was super unexpected for me too - I have no idea why bedtools maskfasta
has that behaviour. As far as I could tell from their code, they have the max number of bases per line equal to the length of the first sequence. It's odd, and there doesn't seem to be a way to change that.
But yeah if the solution would be too complicated on the btllib side, I can deal with it upstream (ie. adding seqtk
as a dependency to make them single-line fastas)
bedtools maskfasta
can create multi-line fasta files with a very large maximum number of bases per line. When I try to use SeqReader with these files, I sometimes get this error:I played around with it a bit, and it seems like SeqReader can miss identifying a fasta file as multi-line depending on the length of the first line:
Sequence lengths of my test file:
So a couple things I'm wondering: