TheFraserLab / ASEr

Get ASE counts from BAMs or raw fastq data -- repackage of pipeline by Carlo Artieri
MIT License
6 stars 3 forks source link

Alter code to use pysam's bam indexing to speed up parsing. #2

Open MikeDacre opened 8 years ago

MikeDacre commented 8 years ago

Per @carloartieri:

pysam's value comes from it being able to handle indexed BAM files. You can restrict the analysis directly to reads overlapping SNPs (if you want) and then use the pysam.read object to extract the variant calls directly and much more rapidly.

Possible issue:

MikeDacre commented 8 years ago

It turns out that you can use the mate method of Samfile to get the mate of any read:

sfile = pysam.Samfile('LD0001_star_nodups.bam')
read = next(sfile)
mate = sfile.mate(mate)

If read is unpaired, it throws a ValueError

carloartieri commented 8 years ago

A few things:

  1. Yes the indexing makes handling mates troublesome. One possible way to handle it is to have a pre-filtering step that randomly picks one of two mapped mates, but I'd have to think about how to implement this efficiently. Another possibility would be to keep a running tally of mates 'touched' and pick one at random - this wouldn't necessarily take up a lot of RAM as you would only have to keep an element in this list until you'd 'seen' both mates.
  2. Apparently the 'mate' method is somewhat erratic as it can change the current position of the index in the file. It's not ideal for use if you're iterating through reads (rather it's intended to get the mate for a specific read). I haven't tested whether this actually creates problems.
  3. Alternatively, it's possible to just iterate through an unindexed BAM file (sorted by mates) in order by doing: for read in pysam.fetch(until_eof=True):. This would allow the script to operate as it currently does, but on BAM files, and still allow access to all of the 'mapped segment' functions.
petercombs commented 8 years ago

Am I wrong that the only step that really cares about which way the input file is sorted is the splitting step? If so, it could be enough of a speed savings to use the indexed file to avoid the need for multiplexing at all.

Alternatively, it seems like using the indexed bam file doesn't actually lead to any speedups—and in fact slows it down by something like 60 fold (see commit 659c4736fe ). Maybe I'm doing something really bad with the implementation, but based on the profiling, it's either gonna be really easy to fix or not at all.