Closed aryarm closed 6 years ago
before splitting, consider removing indels from the gq_file. this will probably make everything run a lot faster
one problem with that approach: proc_counts()
should really only be run once, since it will be the same for every sample
ie you should only need to find which genes overlap SNPs once
perhaps this could be its own script and then it could write output to a file with rsID
as the identifier. then, prepare_counts_parallel.r
can read that file for each sample and perform a merge along rsID
?
also, things would be faster if we could filter GQ files by SNPs
update: this may be a good idea, but it isn't super necessary right now and would probably require rewriting the pipeline quite a bit
you may want to come back to this, but it isn't a huge deal right now. just overlap genes with SNPs for every sample for now
new strategy: create a separate prepare_counts that uses less memory and performs its operations per-sample instead of all at once the new prepare_counts will require the gq_file to be split per sample, so let's do that too