aryarm / as_analysis

A complete Snakemake pipeline for detecting allele specific expression in RNA-seq
MIT License
10 stars 9 forks source link

prepare_counts uses too much memory #38

Closed aryarm closed 6 years ago

aryarm commented 6 years ago

new strategy: create a separate prepare_counts that uses less memory and performs its operations per-sample instead of all at once the new prepare_counts will require the gq_file to be split per sample, so let's do that too

aryarm commented 6 years ago

before splitting, consider removing indels from the gq_file. this will probably make everything run a lot faster

aryarm commented 6 years ago

one problem with that approach: proc_counts() should really only be run once, since it will be the same for every sample ie you should only need to find which genes overlap SNPs once

perhaps this could be its own script and then it could write output to a file with rsID as the identifier. then, prepare_counts_parallel.r can read that file for each sample and perform a merge along rsID?

aryarm commented 6 years ago

also, things would be faster if we could filter GQ files by SNPs

aryarm commented 6 years ago

update: this may be a good idea, but it isn't super necessary right now and would probably require rewriting the pipeline quite a bit

you may want to come back to this, but it isn't a huge deal right now. just overlap genes with SNPs for every sample for now