Multiple, conflicting, index memory release attempts may lead to segfault

avullo commented 6 years ago

Some glitches in the way index memory is released.

From @jmarshall:

"Running the t/03cram.t tests under valgrind shows a double-free bug (the cram index is freed via both hts_close and hts_idx_destroy), present also on the master branch and with any version of HTSlib, that occasionally leads to a segfault (only once in my attempts to reproduce this). I suspect it is this that caused the failed travis run, which ran the t/03cram.t tests twice: once successfully (via ./Build test) and once with a segfault (via travisci/harness.sh)."

avullo commented 6 years ago

@jmarshall I'm trying to reproduce with:

$ valgrind --leak-check=yes prove -v t/03cram.t

and get: ... ... ==28053== 7,199 bytes in 98 blocks are definitely lost in loss record 2,905 of 3,101 ==28053== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==28053== by 0x476074: Perl_safesysmalloc (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x476ADE: Perl_savepv (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x42ED4A: Perl_newXS_len_flags (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x42F020: Perl_newCONSTSUB_flags (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x43B466: Perl_gv_init_pvn (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x43CB20: Perl_gv_fetchpvn_flags (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x43F715: Perl_gv_fetchsv (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x420E26: Perl_ck_rvconst (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x42359B: Perl_newUNOP (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x45005A: Perl_yylex (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x457724: Perl_yyparse (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== ==28053== LEAK SUMMARY: ==28053== definitely lost: 8,130 bytes in 115 blocks ==28053== indirectly lost: 0 bytes in 0 blocks ==28053== possibly lost: 0 bytes in 0 blocks ==28053== still reachable: 7,222,301 bytes in 95,206 blocks ==28053== suppressed: 0 bytes in 0 blocks ==28053== Reachable blocks (those to w==28053== 7,199 bytes in 98 blocks are definitely lost in loss record 2,905 of 3,101 ==28053== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==28053== by 0x476074: Perl_safesysmalloc (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x476ADE: Perl_savepv (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x42ED4A: Perl_newXS_len_flags (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x42F020: Perl_newCONSTSUB_flags (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x43B466: Perl_gv_init_pvn (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x43CB20: Perl_gv_fetchpvn_flags (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x43F715: Perl_gv_fetchsv (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x420E26: Perl_ck_rvconst (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x42359B: Perl_newUNOP (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x45005A: Perl_yylex (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== by 0x457724: Perl_yyparse (in /home/avullo/perl5/perlbrew/perls/perl-5.16.0/bin/perl) ==28053== ==28053== LEAK SUMMARY: ==28053== definitely lost: 8,130 bytes in 115 blocks ==28053== indirectly lost: 0 bytes in 0 blocks ==28053== possibly lost: 0 bytes in 0 blocks ==28053== still reachable: 7,222,301 bytes in 95,206 blocks ==28053== suppressed: 0 bytes in 0 blocks ==28053== Reachable blocks (those to which a pointer was found) are not shown. ==28053== To see them, rerun with: --leak-check=full --show-leak-kinds=all ==28053== ==28053== For counts of detected and suppressed errors, rerun with: -v ==28053== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0)hich a pointer was found) are not shown. ==28053== To see them, rerun with: --leak-check=full --show-leak-kinds=all ==28053== ==28053== For counts of detected and suppressed errors, rerun with: -v ==28053== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0)

From this, I cannot trace back to the double free bug you've mentioned.

jmarshall commented 6 years ago

Scroll back and you will see Invalid read of size 8 after test number 54. It's a lot easier if you hide all the Perl-related leaks with valgrind --suppressions=perl.supp where that file is something like the attached perl.supp.

avullo commented 6 years ago

Thanks for the hint, very useful, but I still cannot see the "Invalid ... " string, am I running it with the correct arguments?

jmarshall commented 6 years ago

The problem here is that, for historical reasons (i.e., 'cos that's how it was written (outwith htslib)), CRAM files store their indices inside the cram_fd rather than in a separate hts_idx_t. So for CRAM files, the hts_idx_t underlying the Bio::DB::HTS::Index is a little fake struct that just points to the associated cram_fd. See samtools/htslib@655edabb50d6da6a6be1e474e371ab325be19461 and samtools/htslib@81b173e6b0c1801d7330118041799dbe29fdf8e8.

It is understood that in C code you need to be careful to destroy the index before you close the htsFile. This is more difficult when there is a garbage collector involved, rather than the programmer explicitly destroying things…

It might be possible to add $something = undef to t/03cram.t to ensure the garbage collector destroys the Bio::DB::HTS::Index first. However the appropriate something isn't really exposed to this level of the code, and this only fixes it for this one instance.

It would probably be possible to add a reference to the Bio::DB::HTSfile to the Bio::DB::HTS::Index object to get the objects destroyed in the right order. Either in the XS code (non-trivial) or in the Perl wrappers (hopefully easier).

Or this could probably be fixed in HTSlib by having the cram_fd contain a linked list of associated hts_cram_idx_t structs, maintained in sam_index_load and hts_idx_destroy, that cram_close() would use to zero out all the fake indices (with fake_idx->cram = NULL) when the file is closed. This would mean that C code didn't have to ensure the ordering of destruction either. @jkbonfield might comment on whether this would be worthwhile for the C level of things.

jmarshall commented 6 years ago

Hmmm… I've added a summary of what I see with valgrind to #66.

jkbonfield commented 6 years ago

The alternative is perhaps to punt cram_index out of cram_fd and into its own top-level structure (cram_index + index_sz fields). This makes it compatible with the way that bam works and it seems less of a hack than keeping track via linked lists.

It does mean a few functions need changing though, such as cram_seek_to_refpos would need an additional index parameter. Rather oddly I see that function has the same prototype twice, along with others! Some borked squash and merge maybe. It's a bit unclear, but they all turned up at the same time in one of the mammoth overhauls in the early days.

I don't think there was ever a specific reason why it differed other than parentage.

jmarshall commented 6 years ago

@jkbonfield: If htslib wanted to make that bigger change, that'd certainly be the way to go rather than compounding the hts_cram_idx_t hack by adding a linked list.

In the meantime, us Perl suckers should look further into telling the Perl reference counting about this link between the objects.

jkbonfield commented 6 years ago

Agreed. Both really need fixing: htslib for future sanity and this perl wrapper for compatibility with existing and old htslib releases. Switching the order around shouldn't be too hard I hope.

avullo commented 6 years ago

@jmarshall I'm naively experimenting with the simplest approach: in XS before calling hts_close, sam_index_load the index and then hts_idx_destroy on it. I haven't seen any problem so far, could you try that on your side as well? issue-67.patch.txt

Only thing is an htslib layer emitted warning with 08bamwrite.t, as we're reloading the index after the original data file has been updated.

jmarshall commented 6 years ago

@avullo Isn't that going to fail horribly if the file is not indexed?

avullo commented 6 years ago

And so it should all the time the main Perl module access the hts_index attribute: https://github.com/Ensembl/Bio-DB-HTS/blob/master/lib/Bio/DB/HTS.pm#L2051

which happens in the fetch/pileup/coverage methods, shouldn't it?!

jmarshall commented 6 years ago

That DESTROY function is still called if you don't use fetch/pileup/coverage, e.g. if you write your own code like that in #54 — that code doesn't use/need an index. However you are in luck: hts_idx_destroy(NULL) works and does nothing.

But anyway the patch doesn't help: I still get the same Invalid read error in t/03cram.t test 54/55 in Perl 5.10.1.

avullo commented 6 years ago

Exactly. I acted assuming passing a null pointer to hts_idx_destroy would simply skip as it should and so that would cover the sam case as well.

Given anyway that this doesn't help, I wonder whether the extra burden of dealing with indexes in HTSfile is worth.

jmarshall commented 6 years ago

It would be possible to fold the index into Bio::DB::HTSfile rather than having a separate Perl object, but it would be a big change. Happily the burden of fixing the refcounting turns out to be lower than I feared — see PR #68.

Ensembl / Bio-DB-HTS

Multiple, conflicting, index memory release attempts may lead to segfault #67

From @jmarshall: