EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

Fixes bug in esl_sqio_ascii.c::loadbuf() related to ESL_SQASCII->is_recording #19

Closed nawrockie closed 7 years ago

nawrockie commented 7 years ago

Ioanna Kalvari of Rfam reported a esl-seqstat seg fault to me on the attached file (orig-fail.fa.txt). The seg fault goes away if --rna or --dna is used. The bug was in esl_sqio_ascii.c::sqfile_GuessAlphabet(), and was a fun one to track down. The bug only causes a problem if the ESL_SQASCII object has 'is_recording' on and 'is_linebased' off, which only occurs in one place in all of Easel/HMMER/Infernal as far as I can tell, in sqfile_GuessAlphabet().

Further, the bug only shows itself if the input file is a single sequence, and in a file of more than 4096 bytes and less than 4000 nucleotides/residues. I've also attached fail.fa.txt and pass.fa.txt which differ by a single nucleotide, but fail.fa.txt causes the seg fault and pass.fa.txt does not.

The problem occurs when a second memory/buffer load is necessary when 'is_recording' is on. That is only necessary if the first memory load of 4096 bytes (eslREADBUFSIZE) does not include the entire first (and only) sequence and that sequence is less than 4000 nucleotides. That is somewhat rare because it will only happen if there's >= 96 non-nucleotide characters in the sequence file (e.g. long name and description), and the file meets the special size/length requirements.

I think the one line fix I'm proposing is the correct one. I've convinced myself that it makes sense: the position of the next buffer to load ('mpos') should not be incremented by 'mn' (number of characters in 'mem') but rather by 'nc' (number of characters in the buffer). 'mn' and 'nc' will be different if the previous fread() hit EOF.

Note that if 'is_recording' is off, the loadmem() call a few lines above the bug fix will have set 'mpos' to 0 and 'mn' to n (number characters read by fread()), so in loadbuf() with the bug fix 'nc' becomes n ('mn' - 'mpos'), and so the bug fix code change has no effect because 'mn' and 'nc' are equal to the same value (n). (This only makes sense if staring at both the loadmem() code and loadbuf() proposed bug fix, and even then it took me several minutes to write.)

All that said, I feel a little light headed after tracing down this problem, so I wouldn't be too surprised if some of my reasoning isn't quite right. Please take a look at let me know what you think.

orig-fail.fa.txt pass.fa.txt fail.fa.txt

nawrockie commented 7 years ago

I did not add anything to unit tests or exercises because I'm not sure what the convention is for adding tests for fixed bugs in Easel. There's no BUGTRAX, right? Let me know what you want and I'll make up a test.

cryptogenomicon commented 7 years ago

Wow.

On Apr 5, 2017, at 9:45 PM, Eric Nawrocki notifications@github.com wrote:

Ioanna Kalvari of Rfam reported a esl-seqstat seg fault to me on the attached file (orig-fail.fa.txt). The seg fault goes away if --rna or --dna is used. The bug was in esl_sqio_ascii.c::sqfile_GuessAlphabet(), and was a fun one to track down. The bug only causes a problem if the ESL_SQASCII object has on and off, which only occurs in one place in all of Easel/HMMER/Infernal as far as I can tell, in sqfile_GuessAlphabet().

Further, the bug only shows itself if the input file is a single sequence, and in a file of more than 4096 bytes and less than 4000 nucleotides/residues. I've also attached fail.fa.txt and pass.fa.txt which differ by a single nucleotide, but fail.fa.txt causes the seg fault and pass.fa.txt does not.

The problem occurs when a second memory/buffer load is necessary when is on. That is only necessary if the first memory load of 4096 bytes (eslREADBUFSIZE) does not include the entire first (and only) sequence and that sequence is less than 4000 nucleotides. That is somewhat rare because it will only happen if there's >= 96 non-nucleotide characters in the sequence file (e.g. long name and description), and the file meets the special size/length requirements.

I think the one line fix I'm proposing is the correct one. I've convinced myself that it makes sense: the position of the next buffer to load () should not be incremented by (number of characters in ) but rather by (number of characters in the buffer). and will be different if the previous fread() hit EOF.

Note that if 'is_recording' is off, the loadmem() call a few lines above the bug fix will have set to 0 and to n (number characters read by fread()), so in loadbuf() with the bug fix becomes n ( - ), and so the bug fix code change has no effect because and are equal to the same value (n). (This only makes sense if staring at both the loadmem() code and loadbuf() proposed bug fix, and even then it took me several minutes to write.)

All that said, I feel a little light headed after tracing down this problem, so I wouldn't be too surprised if some of my reasoning isn't quite right. Please take a look at let me know what you think.

orig-fail.fa.txt pass.fa.txt fail.fa.txt

You can view, comment on, or merge this pull request online at:

https://github.com/EddyRivasLab/easel/pull/19

Commit Summary

• Fixes bug in esl_sqio_ascii.c::loadbuf() that only occurs if ascii->is_recording is on and not line-based, which only occurs in esl_sqio_ascii.c::sqascii_GuessAlphabet. Further input file being read must be a single sequence of more than 4096 bytes but with less than 4000 residues. 4096 is buffer read size, and 4000 is window size for guessing the alphabet. File Changes

• M esl_sqio_ascii.c (2) Patch Links:

https://github.com/EddyRivasLab/easel/pull/19.patchhttps://github.com/EddyRivasLab/easel/pull/19.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

cryptogenomicon commented 7 years ago

Confirmed, and your fix looks correct. I've added a unit test that exercises this bug.