BoPeng / simuPOP

A general-purpose forward-time population genetics simulation environment.
http://bopeng.github.io/simuPOP/
GNU General Public License v2.0
31 stars 12 forks source link

mutation rate #40

Closed Amelia-uoft closed 7 years ago

Amelia-uoft commented 7 years ago

Hi, I have a question about the mutation rate. I used simuPOP to simulate rare variants on 50kb, 100kb, and 300kb region separately. From our simulation results, we found that the mutation rate (probability to observe minor allele at single locus) increases when the size of the region increases. That is to say, 50kb region has lower mutation rate than 100kb region, and 100kb region has lower mutation rate than 300kb region. I am wondering whether such difference is designed/implied by the simulation model used in simuPOP. Thank you!

BoPeng commented 7 years ago

What mutation operator do you use?

Amelia-uoft commented 7 years ago

Sorry. I am not sure what is mutation operator. But my code to simulate the data is as below: python srv.py --gui=batch --selDist=gamma3 --regions=chr1:1..100000 --popFile=s_g3.pop --markerFile=s_g3.map --mutantFile=s_g3.mut

BoPeng commented 7 years ago

I see, are you using srv from here? This script by default uses a diallelic mutation model which should yield even mutations across the region. Anyway,

  1. By default a finite sited mutation model is used in which a mutant could be mutated back to wildtype allele. However, what you observed could potentially be explained by the alternative infinite-site model where repeated mutations would be ignored, thus causing lower observed mutation rates at smaller regions.

  2. How do you measure mutation rate? Do you simply count number of mutations in the mut file? In this case the mutations are final result of the evolutionary process, not the mutations that have been introduced to the population during evolution. According to this line of code, if you use option --verbose 2, you could get a mutations.lst file with all the mutations.

Amelia-uoft commented 7 years ago

My 100kb region was simulated 2years ago, 50kb and 300kb were simulated in Nov 2016. Do you think this problem is because of the changes of "srv" that has been made in the last two years?

After obtaining my simulated data, I only keep variants with MAF less than 1% and greater than 0.05%. Also, I coded each locus as 0 or 1, 0 means two major alleles, 1 means at least one minor allele (we call this as one mutation). The mutation rate is calculated based on each gene, which is number of mutations divided by number of sites in the region.

I am not sure whether our observed difference in the mutation rate is related to your model setting or not.

BoPeng commented 7 years ago

I can only say that in theory the mutation rates should not be correlated with length of regions. If you really would like to resolve this, you can

  1. Use --verbose 2 to re-run the simulation to get the raw mutations from mutations.lst. This will let us know if the mutations are generated correctly. You could use finite and infinite-site models to see if there is a difference.

  2. If mutations are generated correctly and you still observe the differences that you described. You can try to disable natural selection and see if gene length and/or genetic drift is causing the observation, and then restore natural selection. You might find something that can explain the results.

Amelia-uoft commented 7 years ago

Thanks for your suggestions!