MaciekAber / pysam

Automatically exported from code.google.com/p/pysam
0 stars 0 forks source link

Fetch Iterators are not independent #35

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Create two iterators on the same bamfile and region
2. fetch them intermingled
3. compare results

What is the expected output? What do you see instead?
They should deliver the same output.
They read positions fetched differ.

What version of the product are you using? On what operating system?
81ec15fe64a3 Ubuntu 64 bit 10.4

Please provide any additional information below.

Original issue reported on code.google.com by finkerna...@mathematik.uni-marburg.de on 23 Jun 2010 at 3:17

Attachments:

GoogleCodeExporter commented 8 years ago
Workaround is of course to use list(samfile.fetch(...)), but that negates the 
benefit of having an iterator in the first place.

Please note that I haven't formally checked what happens if you fetch different 
regions - but the discrepancies that led me to this bug fetched a lot of 
different regions (potentially overlapping) multi threaded, so I except the bug 
to be present as well.

Original comment by finkerna...@mathematik.uni-marburg.de on 23 Jun 2010 at 3:30

GoogleCodeExporter commented 8 years ago
Wild guess without actually inspecting the code:
I use multiprocessing and fork after creating the Samfile and see inconsistent 
reading of reads. 
That leads me to predict that there's a missing seek in the code since I guess 
the filehandle's what's actually shared between the processes.

Original comment by finkerna...@mathematik.uni-marburg.de on 23 Jun 2010 at 5:02

GoogleCodeExporter commented 8 years ago
Hi, 

this sounds like valid bug, but I can't reproduce it. I added the unit test 
below, but all seems fine. Could you help me reproducing the bug?

Thanks,
Andreas

class TestDoubleFetch(unittest.TestCase):
    '''check if two iterators on the same bamfile are independent.'''

    def testDoubleFetch( self ):

        samfile1 = pysam.Samfile('ex1.bam', 'rb')
        samfile2 = pysam.Samfile('ex1.bam', 'rb')

        for a,b in zip(samfile1.fetch(), samfile2.fetch()):
            self.assertEqual( a, b)

    def testDoubleFetchWithRegion( self ):

        samfile1 = pysam.Samfile('ex1.bam', 'rb')
        samfile2 = pysam.Samfile('ex1.bam', 'rb')

        for a,b in zip(samfile1.fetch( "chr1", 200, 300), samfile2.fetch( "chr1", 200, 300)):
            self.assertEqual( a, b)

Original comment by andreas....@gmail.com on 18 Jul 2010 at 6:23

GoogleCodeExporter commented 8 years ago
Sorry Andreas, I'm afraid I could have worded my problem better.

Concise:
a) Do not create two SamFile objects.
b) Do create two iterators on the same SamFile.

In code (had to extend the region for my example bam file):
class TestDoubleFetch(unittest.TestCase):
    '''check if two iterators on the same bamfile are independent.'''

    def testDoubleFetch( self ):

        samfile1 = pysam.Samfile('ex1.bam', 'rb')

        for a,b in zip(samfile1.fetch(), samfile1.fetch()):
            self.assertEqual( a, b)

    def testDoubleFetchWithRegion( self ):

        samfile1 = pysam.Samfile('ex1.bam', 'rb')
        chr, start, stop = 'chr1', 200, 3000000
        self.assertTrue(len(list(samfile1.fetch ( chr, start, stop))) > 0) #just making sure the test has something to catch

        for a,b in zip(samfile1.fetch( chr, start, stop), samfile1.fetch( chr, start, stop)):
            self.assertEqual( a, b )

Original comment by finkerna...@mathematik.uni-marburg.de on 19 Jul 2010 at 7:49

GoogleCodeExporter commented 8 years ago
actually you were perfectly clear, I misunderstood.

I will check how to fix this.

Thanks,
Andreas

Original comment by andreas....@gmail.com on 19 Jul 2010 at 7:53

GoogleCodeExporter commented 8 years ago
actually you were perfectly clear, I misunderstood.

I will check how to fix this.

Thanks,
Andreas

Original comment by andreas....@gmail.com on 19 Jul 2010 at 7:53

GoogleCodeExporter commented 8 years ago
Hi,

The iterator currently assumes that the file position between calls 
remains unchanged. I see the following solutions:

1. I could do a tell/seek in the iterator. Also, I would need to 
implement some locking, because the seek, read, tell sequence has
to be atomic. This will need to be done properly (threading?) and 
possibly slow down iteration.

2. When creating an iterator, give it its own file handle. I.e., the file
gets opened multiple times. Also, there might be some overhead in reading 
indices,
etc.

3. Do nothing and leave this as a gotcha in the library.

I am tending towards solution 2.

Cheers,
Andreas

Original comment by andreas....@gmail.com on 23 Jul 2010 at 8:15

GoogleCodeExporter commented 8 years ago
I would lean to number 2 as well. Number 1 is complicated to do right (with 
multiprocessing and threading...),
number 3 is not an option either, because it silently fails on the api's 
contract.

Having a file handle for each iterator doesn't seem to be much overhead - they 
are pretty cheap, but then, not knowing the code, I don't see which indices 
you're talking about that wouldn't be reparsed already?

Original comment by finkerna...@mathematik.uni-marburg.de on 23 Jul 2010 at 12:45

GoogleCodeExporter commented 8 years ago
Hi,

fixed (fingers crossed) using option 2.

Please get the latest version 0.3-dev. Note that I have also switched
to samtools 0.1.8

Best wishes,
Andreas

Original comment by andreas....@gmail.com on 23 Jul 2010 at 3:31