Ensembl / Bio-DB-HTS

Git repo for Bio::DB::HTS module on CPAN, providing Perl links into HTSlib
Apache License 2.0
24 stars 16 forks source link

htslib 1.3.1 (and older) non-compatible header_read for SAM #62

Closed avullo closed 6 years ago

avullo commented 6 years ago

I've recently addressed #54, which involved seeking at the beginning of the buffer not only for BAM/CRAM but for SAM files as well, using a dedicated function call (hseek).

This works well for htslib >= 1.5, but generates segmentation faults for v1.3.1 and probably before.

I could see the hseek code in htslib has changed since then: there are some guards acting in special circumstances in the newer versions using hFILE attributes which are not defined in the older versions, but cannot say this is the cause for sure.

This is a problem which is preventing deployment of the upcoming v2.10 in the ensembl web environment, as they rely on htslib 1.3.1 (to support GRCh37 with samtools): the SAM file tests exit unexpectedly so they are deemed to fail. Obviously, this is affecting any installation which ties to htslib < 1.5.

Am I calling hseek correctly with htslib < 1.5.? https://github.com/Ensembl/Bio-DB-HTS/blob/master/lib/Bio/DB/HTS.xs#L496 I cannot see why not.

The current workaround has just the goal of avoiding seg faults and make the tests pass with htslib 1.3.1 and older. Unfortunately, this exposes #54 again in these cases.

keiranmraine commented 6 years ago

@avullo possibly worth asking on htslib to see if they can help. The repo follows/stars suggest you may catch someones attention here but it's more likely to get noticed by a bigger audience with relevant skills.

avullo commented 6 years ago

Thanks @keiranmraine you're right, asking directly htslib might be much more effective.

jmarshall commented 6 years ago

The problem here is that using hseek() or bgzf_seek() when you have a htsFile* is not something that HTSlib supports. In particular, third-party code should not be using code like htsfile->fp.hfile to try to extract the underlying hFILE* — the note on typedef … htsFile in htslib/hts.h is supposed to tell you this, but I guess it's not very explicit. Prior to samtools/htslib@71c03b88edf892a93eae7bb7932a6da84858480a, htsfile-> fp.hfile was not merely unsupported, it was an incorrect way to get at the hFILE* for a SAM file, which is why your code is crashing with HTSlib ≤ 1.3.1.

See also the discussion in this mailing list posting.

The sad fact is that the only supported way of using hts_open() and seeking around in a SAM/BAM/CRAM file is to use the iterator you get from sam_itr_queryi() et al. Until HTSlib gets an hts_seek() function as discussed in that mailing list thread, you can only do sequential access or use iterators. Extracting the underlying hFILE* yourself and using hseek() might seem to work under limited circumstances, but it's unsupported and might screw up the htsFile's buffering at any time.

What this code should be doing is limiting itself to reading sequentially from the htsFile*.

You need to ensure that sam_hdr_read() gets called exactly once. Basically everything needs to use Bio::DB::HTS::header() (or the header stashed in $self->{header}) rather than Bio::DB::HTS::header_read(), and the function that actually calls sam_hdr_read() needs to become a private internal-only function. Then the _check format and do bgzfseek or hseek code in hts_header_read and bami_coverage can be removed.

avullo commented 6 years ago

Many thanks @jmarshall for the enlightening and exhaustive explanation. Pardon I didn't pay attention to notes in the htsFile declaration; I was just trying to mirror the BAM behaviour and I had assumed this was a legitimate way to go in the other case.

This is now sufficiently clear and I can take appropriate action. I've put a KNOWN BUGS section in the main POD warning the user about this case and another one.

cheers John

avullo commented 6 years ago

This cannot be addressed without an entire rewrite of the logic with which to access headers. This, alongside a possible solution, is captured by #65