chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

memory usage #7

Closed lurebgi closed 6 years ago

lurebgi commented 7 years ago

Hi,

I am analyzing the polymorphism data (vcf files) from a bird species (genome size ~1.2G). I split the genome into 10 parts, but still the memory usage reached ~30G. Do you have an idea how to split/process the input files to reduce the memory usage?

Thanks a lot!

Best, Luohao

singing-scientist commented 7 years ago

Dear Luohao,

Thanks so much for using SNPGenie! Unfortunately, I do not have plans to speed up the actual algorithm at this time. Assuming that your input data is in the form described—a VCF file with SNP data for single reference FASTA—I think one good approach is to split up the genome, as you have done, probably by chromosome. If this does not work, then you could try smaller subsections; this is actually quite easy to do and to automate, since you can extract a range (a-b) of sites from the FASTA, and then pull out variants from the VCF for only those sites. Another approach would be to target specific genomic regions of interest.

Another program to try is PoPoolation, which might be faster if you turn OFF corrections, but it makes various approximations and essentially assumes all variant positions in the raw reads are bona fide SNPs, i.e., it does not take advantage of SNP calling. And, if your input data is not for deep (pooled) sequencing of a single sample but rather a summary of many genomes, PoPoolation is not applicable.

Please let me know if any of this is helpful! Apologies for the memory difficulties. Chase