EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

Fix faulty `xr_tag` allocation in `esl_sq_Copy` #75

Closed althonos closed 5 months ago

althonos commented 5 months ago

Hi!

After getting some segfaults with PyHMMER for unknown reasons I started a round of memory check with Valgrind, and found some invalid writes in the HMMER and Easel code. I'll try to address them one at a time if I can trace them back.

==268312== Invalid write of size 1
==268312==    at 0x484C3DE: strcpy (vg_replace_strmem.c:561)
==268312==    by 0x6B1CCBD: UnknownInlinedFun (string_fortified.h:79)
==268312==    by 0x6B1CCBD: esl_sq_Copy (esl_sq.c:339)
==268312==    by 0x6B4CDFF: sqascii_Read (esl_sqio_ascii.c:772)
==268312==    by 0x6AD7A22: __pyx_f_7pyhmmer_5easel_12SequenceFile_readinto.lto_priv.0 (easel.c:85494)
==268312==    by 0x6AD0BF4: __pyx_f_7pyhmmer_5easel_12SequenceFile_read.lto_priv.0 (easel.c:85169)
==268312==    by 0x6AC5C43: UnknownInlinedFun (easel.c:84134)
==268312==    by 0x6AC5C43: __pyx_pw_7pyhmmer_5easel_12SequenceFile_17__next__.lto_priv.0 (easel.c:84112)
==268312==    by 0x4AB0D14: list_extend (listobject.c:966)
==268312==    by 0x4AE4151: UnknownInlinedFun (listobject.c:2790)
==268312==    by 0x4AE4151: list_vectorcall (listobject.c:2815)
==268312==    by 0x4A7B236: UnknownInlinedFun (pycore_call.h:92)
==268312==    by 0x4A7B236: PyObject_Vectorcall (call.c:299)
==268312==    by 0x4A6D6E0: _PyEval_EvalFrameDefault (ceval.c:4769)
==268312==    by 0x4AB3E9E: UnknownInlinedFun (pycore_ceval.h:73)
==268312==    by 0x4AB3E9E: UnknownInlinedFun (ceval.c:6434)
==268312==    by 0x4AB3E9E: UnknownInlinedFun (call.c:393)
==268312==    by 0x4AB3E9E: UnknownInlinedFun (pycore_call.h:92)
==268312==    by 0x4AB3E9E: method_vectorcall (classobject.c:89)
==268312==    by 0x4A716A2: UnknownInlinedFun (ceval.c:7352)
==268312==    by 0x4A716A2: _PyEval_EvalFrameDefault (ceval.c:5376)

Here I got some invalid writes in esl_sq_Copy while parsing a file. It turns out that the allocation of extra residue tags is using the nalloc even though the tags may contain free text. This will lead to underallocation in the case were an xr_tag is longer than the name stored in the sequence. This can be seen for instance by attempting to parse stockholm.good.1: there is a sequence named seq6 with markups Invented_tag and Another_tag.

To fix this, I changed the xr_tag allocation to use strlen and allocate with the true string length.

cryptogenomicon commented 5 months ago

Thanks, I agree!