EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

esl_sqio_ascii rare bug: final record line longer than all previous does not invalidate fast subseq retrieval #47

Closed nawrockie closed 4 years ago

nawrockie commented 4 years ago

If a sequence file has a record that has equal bytes/residues on each line except for the final one, then it will be flagged for fast subsequence retrieval. This is by design because the final line can be, and usually is, shorter than all the rest, but if the final line has more characters than all previous lines it can lead to a problem. Specifically if you try to fetch a subseq for which all nucleotides occur past the final rpl'th character of the final line for that record, you get an error.

I propose a fix to not flag such sequence files for fast subsequence retrieval if the final line has greater than rpl/bpl residues/bytes.

Admittedly I'm not sure of all the impacts of this change, since seebuf() is so central, but the tests all pass, and the change does look innocuous to me.

Example files to reproduce (I had to add the .txt suffices to enable upload...)

ok.fa.txt: sequence file with 11 nt sequence AAACCCGGGGG named 'seq' with 3 nt on line 1 and 4 on lines 2 and 3.

bug.fa.txt: sequence file with same sequence as ok.fa with 3 nt on lines 1 and 2 and 5 on line 3.

repro.sh.txt: script to reproduce the problem that indexes ok.fa and bug.fa and then tries to fetch the subsequence 11..11. This works for ok.fa but throws an error for bug.fa, demonstrating the bug.

nawrockie commented 4 years ago

Also, let me know if you want me to add a script to the testsuite that reproduces the bug.fa/ok.fa example.

cryptogenomicon commented 4 years ago

Looks right. No need to add a test to the testsuite for this one, I think.