TheFraserLab / ASEr

Get ASE counts from BAMs or raw fastq data -- repackage of pipeline by Carlo Artieri
MIT License
6 stars 3 forks source link

Fetch over entire genes... #13

Open carloartieri opened 8 years ago

carloartieri commented 8 years ago

I was thinking about it a little more and the idea of 'fetching' reads by SNP isn't going to work. The reason is that SNPs are frequently near enough to one another that reads can span more than one. While iterating over SNPs, this will re-fetch the same reads and lead to double-counting unless you store a list of already-counted SNPs in memory. Regardless, it will be relatively inefficient.

Therefore it makes more sense to accept the list of SNPs and a GTF file, then parse the GTF to fetch only over whole genic regions at a time. Going exon-by-exon would work, as long as each fetch stipulated that reads overlapping the previous exon would be excluded. For example, it's straighforward to create a column in a pandas dataframe, sorted by chrom+position, that notes the end position of the previous element in the table.