BoPeng / simuPOP

A general-purpose forward-time population genetics simulation environment.
http://bopeng.github.io/simuPOP/
GNU General Public License v2.0
31 stars 12 forks source link

Filter Monomorphic Sites #39

Closed snunzi closed 7 years ago

snunzi commented 7 years ago

Hi Bo, I am currently filtering monomorphic sites out of my simulated sequences before export, and it is going extremely slow due to high number of individuals and very high amount of sequence data. I have been using the code below, which works for small data sets, but not my larger ones. Do you have any suggestions to filter out monomorphic sites more efficiently? Many thanks.

-Schyler

thresh_hi=0.999999 thresh_lo=0.000001 lociToRemove = [l for l in xrange(pop.totNumLoci()) if pop.dvars().alleleFreq[l][0] > thresh_hi or pop.dvars().alleleFreq[l][0] < thresh_lo] pop.removeLoci(lociToRemove)

BoPeng commented 7 years ago

Could you give me some details such as what allele module (short, long, etc), estimated number of individuals, chromosome, loci, and fraction of monomorphic sites? If you only have 2 alleles, binary module could help, if you have mostly rare variants, mutant module could help. To switch between modules, you could save the population and load it in another module.

BoPeng commented 7 years ago

Also, you could potentially skip the filtering process by exporting only genotype at specified loci...

snunzi commented 7 years ago

Binary mode helped a lot, thank you!