If a sequence file has a record that has equal bytes/residues on each
line except for the final one, then it will be flagged for fast
subsequence retrieval. This is by design because the final line can
be, and usually is, shorter than all the rest, but if the final line has
more characters than all previous lines it can lead to a
problem. Specifically if you try to fetch a subseq for which all
nucleotides occur past the final rpl'th character of the final line
for that record, you get an error.
I propose a fix to not flag such sequence files for fast subsequence
retrieval if the final line has greater than rpl/bpl residues/bytes.
Admittedly I'm not sure of all the impacts of this change, since
seebuf() is so central, but the tests all pass, and the change
does look innocuous to me.
Example files to reproduce (I had to add the .txt suffices to enable upload...)
ok.fa.txt:
sequence file with 11 nt sequence AAACCCGGGGG named 'seq' with
3 nt on line 1 and 4 on lines 2 and 3.
bug.fa.txt:
sequence file with same sequence as ok.fa with 3 nt on lines 1
and 2 and 5 on line 3.
repro.sh.txt:
script to reproduce the problem that indexes ok.fa and
bug.fa and then tries to fetch the subsequence 11..11. This works for
ok.fa but throws an error for bug.fa, demonstrating the bug.
If a sequence file has a record that has equal bytes/residues on each line except for the final one, then it will be flagged for fast subsequence retrieval. This is by design because the final line can be, and usually is, shorter than all the rest, but if the final line has more characters than all previous lines it can lead to a problem. Specifically if you try to fetch a subseq for which all nucleotides occur past the final rpl'th character of the final line for that record, you get an error.
I propose a fix to not flag such sequence files for fast subsequence retrieval if the final line has greater than rpl/bpl residues/bytes.
Admittedly I'm not sure of all the impacts of this change, since seebuf() is so central, but the tests all pass, and the change does look innocuous to me.
Example files to reproduce (I had to add the .txt suffices to enable upload...)
ok.fa.txt: sequence file with 11 nt sequence AAACCCGGGGG named 'seq' with 3 nt on line 1 and 4 on lines 2 and 3.
bug.fa.txt: sequence file with same sequence as ok.fa with 3 nt on lines 1 and 2 and 5 on line 3.
repro.sh.txt: script to reproduce the problem that indexes ok.fa and bug.fa and then tries to fetch the subsequence 11..11. This works for ok.fa but throws an error for bug.fa, demonstrating the bug.