Alter code to use pysam's bam indexing to speed up parsing.

MikeDacre commented 8 years ago

Per @carloartieri:

pysam's value comes from it being able to handle indexed BAM files. You can restrict the analysis directly to reads overlapping SNPs (if you want) and then use the pysam.read object to extract the variant calls directly and much more rapidly.

Possible issue:

indexed bams are sorted by coordinate, and mate pairs are not guaranteed to be adjacent in the file. Is it possible to split the file for multiplexing despite this?

MikeDacre commented 8 years ago

It turns out that you can use the mate method of Samfile to get the mate of any read:

sfile = pysam.Samfile('LD0001_star_nodups.bam')
read = next(sfile)
mate = sfile.mate(mate)

If read is unpaired, it throws a ValueError

carloartieri commented 8 years ago

A few things:

Yes the indexing makes handling mates troublesome. One possible way to handle it is to have a pre-filtering step that randomly picks one of two mapped mates, but I'd have to think about how to implement this efficiently. Another possibility would be to keep a running tally of mates 'touched' and pick one at random - this wouldn't necessarily take up a lot of RAM as you would only have to keep an element in this list until you'd 'seen' both mates.
Apparently the 'mate' method is somewhat erratic as it can change the current position of the index in the file. It's not ideal for use if you're iterating through reads (rather it's intended to get the mate for a specific read). I haven't tested whether this actually creates problems.
Alternatively, it's possible to just iterate through an unindexed BAM file (sorted by mates) in order by doing: for read in pysam.fetch(until_eof=True):. This would allow the script to operate as it currently does, but on BAM files, and still allow access to all of the 'mapped segment' functions.

petercombs commented 8 years ago

Am I wrong that the only step that really cares about which way the input file is sorted is the splitting step? If so, it could be enough of a speed savings to use the indexed file to avoid the need for multiplexing at all.

Alternatively, it seems like using the indexed bam file doesn't actually lead to any speedups—and in fact slows it down by something like 60 fold (see commit 659c4736fe ). Maybe I'm doing something really bad with the implementation, but based on the profiling, it's either gonna be really easy to fix or not at all.

TheFraserLab / ASEr

Alter code to use pysam's bam indexing to speed up parsing. #2