aryarm / as_analysis

A complete Snakemake pipeline for detecting allele specific expression in RNA-seq
MIT License
10 stars 9 forks source link

modify our use of WASP snakemake pipeline to save space #34

Closed aryarm closed 6 years ago

aryarm commented 6 years ago

In the find_intersecting_snps step, WASP separates reads that don't overlap a SNP from those that do. Later, reads that didn't overlap a SNP are merged back into a final BAM final with the filtered reads. However, allele specific analysis really only cares about the reads that overlap a SNP. Perhaps we can get rid of the merging step so that reads that don't overlap a SNP can be discarded instead? This would certainly help to save space and speed up execution of downstream analysis, since only a small percentage of reads actually overlap a SNP in most samples.

aryarm commented 6 years ago

Attached is a diff between counts files that have been generated through the new method (from a BAM file that only contains reads that overlap SNPs) and the old method (from a BAM file that contains all of the reads). These were performed on Jurkat test data.

Interestingly enough, the counts file that was generated from the larger BAM file contains smaller read counts in some SNPs. Perhaps this has something to do with how rmdup_pe randomly chooses which read to keep? I'd have to perform this test a couple more times before I can make any conclusions.

Overall, however, I think the counts file from the new method is a good approximation of the old one.