TheFraserLab / ASEr

Get ASE counts from BAMs or raw fastq data -- repackage of pipeline by Carlo Artieri
MIT License
6 stars 3 forks source link

Skipping of Indels in CountSNPASE #15

Open petercombs opened 8 years ago

petercombs commented 8 years ago

So I get why we want to skip over indels—they're much more likely to have trouble mapping correctly. But the way it's working now, it actually seems more likely to have a bias, since it's looking in the cigar string for indels, but that would only show up in the cigar string for one of the alleles, leaving the other alone. The SNP itself doesn't get used because presumably we were smart enough to not include indels in the bed file, but if there's an indel next to a SNP, that might introduce bias.

In this toy example, the red is reference, and the blue has a 1bp deletion. In the "truth" the expression is unbiased, but ASEr would call it 2:1, since the blue read that falls across both the SNP and the indel gets thrown out for having a deletion, but the red one doesn't.

screen shot 2016-06-20 at 3 40 16 pm

My intuition would be that the way to fix this is to come up with a list of indels, and then check to see if each read falls across one, and throw it out if so. However, some quick thought while procrastinating on writing a fellowship proposal did not suggest to me any obvious ways to do this efficiently.

rmagoglia commented 8 years ago

The easiest way to avoid this issue is to filter your list of SNPs to remove those that are within one read length of an indel. If you have a bed file of your SNPs and a separate bed file of your indels (snps.bed and indels.bed), a simple way is this:

For each SNP find the closest indel: bedtools closest -t first -a snps.bed -b indels.bed -d > snps.closest

If the closest INDEL is closer than read length (say 125bp), throw it out: awk '{if ($11 > 125) print $1 "\t" $2 "\t" $3 "\t" $4 "|" $5}' snps.closest > snps.indelsPurged.bed

This also formats the resulting snp file as "Chrom 0-Position 1-Position Ref|Alt"