blachlylab / dhtslib

D bindings and OOP wrappers for htslib
MIT License
7 stars 1 forks source link

AllRecordsRange misses first record #59

Closed jblachly closed 3 years ago

jblachly commented 3 years ago

First record somehow gets burned and never returned wiht foreach

Demonstrated with an integration test, push incoming

jblachly commented 3 years ago

Is this the problem?

https://github.com/blachlylab/dhtslib/blob/428266a6b5516493a9a9770be689b260d80a67f9/source/dhtslib/sam/package.d#L1092

If so I am having a hard time following why

jblachly commented 3 years ago

Yep, that's the problem. However, fixing it makes 3 of 8 unittests fail ...

jblachly commented 3 years ago

This may also be a samtools/htslib sam_read1 bug ...

[I::test.samreader.main] Test AllRecordsRange iterator
empty
front

popFront
empty
front
HS18_09653:4:1308:11522:27107
popFront
empty
front
HS18_09653:4:2314:14991:85680
jblachly commented 3 years ago

So the use of SAMFile in the test script prior to AllRecordsRange to do queries causes the first call to sam_read1 to return the second record ...

charlesgregory commented 3 years ago

Might need to call an htslib function to make sure file iteration is reset when calling allRecords.

On Mar 8, 2021, at 9:04 PM, James S Blachly, MD notifications@github.com wrote:



So the use of SAMFile in the test script prior to AllRecordsRange to do queries causes the first call to sam_read1 to return the second record ...

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/blachlylab/dhtslib/issues/59#issuecomment-793266239, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACH5L7MDPGLRTNBZXNFYKJDTCV64HANCNFSM4Y2JPFRQ.

jblachly commented 3 years ago

Looking for it now, but doesn't appear to be antyhing along the lines of "seek" etc.

jblachly commented 3 years ago

@charlesgregory found it.

It is actually RecordRange with naughty behavior (calling the iterator upon range initialization in the ctor without first checking that the iterator has results) coupled with the immediately preceding test being a test for query with zero results. The extra call to sam_itr_next when itr.finished is actually true, but hasn't been checked, consumes the first read of the bamfile and advances the fp.

https://github.com/blachlylab/dhtslib/blob/428266a6b5516493a9a9770be689b260d80a67f9/source/dhtslib/sam/package.d#L1154-L1163

i'm fixing this in a series of commits

jblachly commented 3 years ago

Further note to self, in fact the last read of a query iterator always consumes an read (apparently) when it returns -1 (i.e., consumes a read that's not returned in the results set), and sam_read1 always picks up after where the iterator left off.

will need probably to directly manipulate the fp to seek(0) in case of AllRecordsRange

jblachly commented 3 years ago

Fixed in fcb0047b73e1b8c8eef7e9631891272be7597c2d but loses ability to call allRecords successfully on uncompressed .sam