bpp / bpp

Bayesian analysis of genomic sequence data under the multispecies coalescent model
GNU Affero General Public License v3.0
92 stars 27 forks source link

question on msci format #166

Open svedwards opened 2 years ago

svedwards commented 2 years ago

Dear Tomas,

I am curious if the introgression probability phi can be estimated or if one must specify it in the msci treefile.

I made a file that ran successfully: (((CP,T[&phi=0.200000])H,CP_We)X, ((CM,H[&phi=0.100000])T,CM_Ne)Y)R;

Does this mean that phi in both directions is specified beforehand? Is there a way to estimate phi from the sequence data?

I noticed the posterior distributions of the two phi in the output were quite flat - image attached, also with control file. Is this expected or is there not enough information in my data?

Thanks,

Scot Cpic_mel_4pop_bpp_mcmc_summary_gene_flow.pdf Cpic_mel_4pop_bpp_HKY_locvar_msci1_ctl.txt t

xflouris commented 2 years ago

Hello Scott, just to double-check, you are have the species tree:

((CP,CP_We)X,(CM,CM_Ne)Y)R;

with bidirectional introgression between edges X-CP and Y-CM. The bidirection is labeled as H<->T (where H is parent to CP and T is parent to CM).

The two phi values you specified in the newick notation are just initial values. Phis are estimated from sequence data using mcmc. They are introgression probabilities indicating the proportion of introgressed genes, and so the estimates will be between 0 and 1.

Do multiple runs converge? Your dataset has many sequences which is good, but you are only using 14 loci. To accurately estimate introgression probabilities we observed that typically a large number of loci is required.

Best regards, Tomas

svedwards commented 2 years ago

Hi Tomas -

Many thanks for responding and sorry for the long delay (I still haven't figure out how to have Github notify me of a response!).

Ok, your response makes perfect sense and you interpreted the tree and my goals correctly. Here are two runs that I believe are from this analysis. In general this data set of 14 loci has delivered fairly stable estimates of the parameters across runs:

msci1 run1 run 2
mean lower upper mean lower upper theta_1CP 0.005322829 0.003455 0.007392 0.00532967 0.003468 0.007389 theta_2CP_We 0.006414966 0.004107 0.008901 0.006439514 0.004172 0.008943 theta_3CM 0.010305196 0.007466 0.013311 0.010347639 0.007477 0.013333 theta_4CM_Ne 0.001468425 0.000859 0.002133 0.00148852 0.000855 0.002155 theta_5R 0.001074296 0.000347 0.001901 0.001098919 0.00036 0.001927 theta_6X 0.003341582 0.002105 0.00469 0.003318373 0.00209 0.004637 theta_7H 0.001600508 0.000193 0.0034 0.001527911 0.000131 0.003264 theta_8Y 0.004167763 0.00272 0.005742 0.004118664 0.002669 0.005685 theta_9T 0.001654701 0.000181 0.003495 0.001709837 0.000153 0.003629 tau_5R 0.002436824 0.001832 0.003083 0.002403857 0.001797 0.003015 tau_6X 0.00038903 0.000277 0.000502 0.000392602 0.000281 0.000512 tau_7H 0.000366112 0.000254 0.000474 0.000372717 0.000261 0.000489 tau_8Y 0.0003748 0.000262 0.000486 0.000383054 0.00027 0.000502 tau_9T 0.000366112 0.000254 0.000474 0.000372717 0.000261 0.000489 phi_H 0.065159104 0.015703 0.121852 0.065716812 0.01569 0.122828 phi_T 0.032622804 0.006042 0.064495 0.032789748 0.005992 0.065125

Thanks again for your advice and clarification.

Scott

svedwards commented 2 years ago

I think the result I sent you first was an aberrant run. In other analyses the posterior of phi is quite normal (attached; red is phiH, green phiT). I think I was confused by the specification of starting values for phi in the ctl tree file. Your response confirms that phi is indeed estimated. I will run again just to make sure since two other runs I have look fine with no flat posterior on phi as I sent earlier. Thanks again, Scott PhiH_T_run2_posterior.pdf