Fetch over entire genes...

I was thinking about it a little more and the idea of 'fetching' reads by SNP isn't going to work. The reason is that SNPs are frequently near enough to one another that reads can span more than one. While iterating over SNPs, this will re-fetch the same reads and lead to double-counting unless you store a list of already-counted SNPs in memory. Regardless, it will be relatively inefficient.

Therefore it makes more sense to accept the list of SNPs and a GTF file, then parse the GTF to fetch only over whole genic regions at a time. Going exon-by-exon would work, as long as each fetch stipulated that reads overlapping the previous exon would be excluded. For example, it's straighforward to create a column in a pandas dataframe, sorted by chrom+position, that notes the end position of the previous element in the table.

TheFraserLab / ASEr

Fetch over entire genes... #13