Theta0, popt, and divergence Times

btmartin721 commented 4 years ago

Hello,

I am having an issue plotting the GADMA results. Mainly, the divergence times are pretty far off where I expected them to be. I am not 100% sure that I am doing it correctly, so I have explained what I did below. Could you please let me know if I am doing something incorrectly?

I ran GADMA with the default theta0 (1.0). Per the GADMA manual, I then tried to re-scale popt based on the new theta0. However, theta0 seems really small compared to the examples in the manual. My calculated theta0 (4 u L) is:

theta0 = 0.00041795, where u = 6.93972E-10 and L = 150564.2833
I calculated u per the GADMA manual as well, where the estimated divergence time is ~18.6 million years, generation time is 10 years, and the sequence divergence (Dxy, averaged among all ddRAD loci, calculated using DNAsp) = 0.002581689.
The parameters for the best GADMA model (3 populations) are:

s0, nu21, nu22, nu31, nu32, s1, nu41, nu42, nu43, t1, t2, t3, m1_12, m1_21, m2_12, m2_21, m3_12, m3_13, m3_21, m3_23, m3_31, m3_32 = params

I have tried using the popt parameters as found in the best_aic_model_moments_code.py file (with theta0 not changed and = 1.0), and I got a divergence time that is too long (~39 million years) when using popt with theta0 = 1.0.
However, if I try to re-scale the popt parameters based on theta0 = ~0.0004, I get impossible divergence times. I tried re-scaling popt like so:

s0 and s1 stayed the same  
nu and t parameters got divided by 0.0004 (alpha = 0.0004 / 1.0 = 0.0004)
m parameters got multiplied by 0.0004 (alpha = 0.0004 / 1.0 = 0.0004)

The divergence time was then >800 million years, which is not possible.

When plotting the results, I calculated Nref by doing:
- theta = moments.Inference.optimal_sfs_scaling(model, data)
- Nref = theta / theta0
- I then plotted using moments.ModelPlot.plot_model() with the above Nref (=693039.355637993).

Is there anything that I am doing wrong or missing?

My params, extra_params, GADMA.log, python, and input FS files are attached in case you need them.

output_files.zip

DS-MX-ON.zip

Thanks for your time.

-Bradley

noscode commented 4 years ago

Dear Bradley,

Thank you for your interest to GADMA, I hope we will solve your problem. You have tried a lot of things and it looks that you have done everything right. I have some doubts about your L, why it is not an integer? L is the length of sequence that was used for AFS build. It is very small in your case so check that it is in base pairs and it is length of sequence (not the number of SNP's).

I have also one idea that model may be a little misspecified. Have you tried 2,1,1 structure? Have you tried to infer two populations? It is faster and it is possible to estimate Nanc. I tried to run for two populations with 1,1 structure and I could not say that there was a great change. So I would probably check out value of L and u.

btmartin721 commented 4 years ago

Hi Ekaterina,

I haven't tried 2,1,1 structure. I can give that a try. Would it alternatively be reasonable to try adding an upper bound on the splits? E.g., adding an upper bound of 18.6 million years for split1?

I calculated L in the following way:

L ~ effective sequence length = (total sequence analyzed) x (SNPs retained for use in moments) / (total SNPs in analyzed sequence)

where total sequence analyzed was calculated as:

average length of each ddRAD locus  * number of ddRAD loci (14809) = 87.14639746 bp * 14809 loci = 1290551 bp

SNPs retained for use in moments was 1722 after filtering out non-biallelic SNPs, filtering out sites with >50% missing data per population, and filtering out sites with >50% missing data overall.

The total SNPs analyzed in the sequence was 14760 (~the number of ddRAD loci because I took one SNP per locus)

So, this adds up to:

L ~ effective sequence length ~ (total sequence analyzed) x (SNPs retained for use in moments) / (total SNPs in analyzed sequence) = 1290551 * 1722 / 14760 = 150564.2833 bp

u was calculated per the GADMA manual, as:

u =   Dxy * generation time / (2 * divergence time in years) = 0.002581689 * 10 years / (2 * 18600800) = 6.93972E-10

With the divergence time being estimated using IQ-TREE2 (calibrated with fossils) as ~18.6 million years, the generation time being 10 years, and the sequence divergence (Dxy) = 0.002581689, estimated using DNAsp.

Does all this sound reasonable? Or am I miscalculating something?

Thanks again for your time and help.

-Bradley

noscode commented 4 years ago

Dear Bradley,

It sounds quite perfect. I am afraid that it maybe a problem with estimations of sequence length in ddRADseq and divergence time. Have you tried to find existing mutation rate or you have a non model organism? Or maybe there are other estimators of divergence time in addition to IQ-TREE2?

It looks like you have a not a big number of SNP's, it is not bad but it may influence on the inference so try 2,1,1 structure. If you have time you could try (2,1) structure for two populations as well.

I have one more suggestion about translation of the parameters: you could estimate size of an ancestral population (Nanc) or other known parameter (like your time of divergence) and rescale parameters according to it. We have done it for some research especially when it is RADseq and no known mutation rate. In GADMA paper it was done for butterflies: parameters were scaled according to size of one population that was known.

Ekaterina

btmartin721 commented 4 years ago

I do have a non-model organism, but I could look into what may be the closest mutation rate. There are full genomes published for a closely related genus that may help. I can also try 2,1,1 and 2,1 structure to see if that will help. And I'll look at the GADMA paper for rescaling based on Nanc. Hopefully one or more of these solutions will help.

One final thought: One of the taxa in the FS file I sent you only has 3 individuals (6 alleles) and I down-projected the other taxa to 8 alleles [6,8,8], so I'm wondering if this low ns may limit the capabilities of GADMA to accurately estimate the split time. I actually just had a run finish where the ns were much larger [36,34,8]. In this case, only the first taxon was different. The divergence time should also have been 18.6 mya, and GADMA estimated it to be 13 mya. So perhaps the better ns helps.

In any case, I'll give the solutions you mentioned a try. I greatly appreciate your time and help!

-Bradley

noscode commented 4 years ago

Dear Bradley,

You are totally right, the bigger size of spectrum is the better inference will be. So if there is bigger ns then use it, it will be more reliable indeed. On the other hand the accuracy of method depends not on the size of spectrum but on the number of segregating sites. easySFS could help to find tradeoff between size and number of segregating sites. GADMA will probably find different divergence time to the one that you have got from IQ-TREE2. If you have got 13 mya then it I think it looks fantastic! Keep me updated.

Best regards, Ekaterina

btmartin721 commented 4 years ago

Ok great. I really appreciate it! I did use easySFS to find the optimal down-projections, but the problem with the taxon that has only 3 individuals is that it is an exceptionally rare subspecies. So 3 samples is all I had :)

Thanks again, and many thanks for responding to my questions so quickly!

-Bradley

noscode commented 4 years ago

You are always welcome! 3 samples is okay if it is only for one population. Good luck and feel free to ask further questions.

Best regards, Ekaterina

ctlab / GADMA

Theta0, popt, and divergence Times #9