Sendrowski / fastDFE

Fast and flexible inference of the distribution of fitness effects (DFE), VCF-SFS parsing with ancestral allele and site-degeneracy annotation.
https://fastdfe.readthedocs.io
GNU General Public License v3.0
11 stars 0 forks source link

The parameters of joint inference #8

Open fatlonggg opened 6 days ago

fatlonggg commented 6 days ago

Hi Janek, when I used the joint inference, I need to set different parameters to be shared by different populations. However, I can't find the guidence about all the alternative parameters. I can only find 5 parameters from the API of DFE Parametrizations: S_d, b, P_b, S_b, eps. Could you please offer all the alternative parameters and their meanings? By the way, in the API, I can find two contradictory explainations for a same abbreviation of a parameter. For example, for "S_d", I found the explaination "Mean selection coefficient for deleterious mutations" and "Mean of the DFE for S >= 0". Did the same abbreviations have different meaning in different circumstances? Or it's just a typing error?

fatlonggg commented 6 days ago

Hi Janek, it's me again. I have another question. Which one did fastDFE infer, the distribution of fitness effect of mutation sites, or the distribution of fitness effect of mutations? For example, assuming that I have 4 sites in my vcf, and each site has a different fitness effect. Site 1 have a fitness effect in (inf, -100), site 2 have [-100, -10), site 3 have [-10, -1), and site 4 have [-1, 0). For these sites, my accessions in the vcf have different allele numbers. My accessions in site 1 have 40 heterozygous mutations, site 2 have 30 heterozygous mutations, site 3 have 20 heterozygous mutations, site 4 have 10 heterozygous mutations. If I only have such 4 sites, then what will the result look like? Will the DFE be 0.25 for all 4 fitness effect size (infer the DFE for mutation sites)? Or be 0.4, 0.3, 0.2, 0.1 for (inf, -100), [-100, -10), [-10, -1), [-1, 0), respectively (infer the DFE for mutations)? And further, how should I understand the description for "p_b: Probability of a beneficial mutation"? Does the probability refer to the probability of the mutation sites or the mutations that present in my accessions?

Sendrowski commented 6 days ago

Hi!

In addition to eps, the ancestral allele misidentification parameter, you can specify any parameter of the used Parametrization to fixed.

For example,

fd.SharedParams(params=['p_b', 'S_b'], types=['type1', 'type2'])

specifies the parameters p_b and S_b of GammaExpParametrization to be shared between type1 and type2.

Regarding, the statement in GammaDiscreteParametrization that S_d is for positive selection coefficients, it's a typo. I'll fix that.

Sendrowski commented 6 days ago

Hi!

The DFE is the probability distribution of selection coefficients for all considered (non-synonymous) sites, collectively. If you wish to have different probability distributions for different types of sites, you would need to stratify your sites into different component SFS, and infer the DFE for each stratification.

I'm not entirely sure about your question, but the inferred DFE can be interpreted as the most likely probability distribution of (population-scaled) selection coefficients that gives rise to the observed allele frequencies of (non-synonymous) sites (as summarized by the selected SFS).

Similarly, the p_b parameter in GammaExpParametrization is the probability for a non-synonymous mutation to be beneficial (S > 0).

fatlonggg commented 6 days ago

In addition to eps, the ancestral allele misidentification parameter, you can specify any parameter of the used Parametrization to fixed.

Thank you for your reply. But I am still confused about what other parameters can be set to fixed. When I ran joint inference, the output remind me: A large number of parameters is optimized jointly (13). What did the '13' means? Was it saying that 13 parameters were optimized jointly? What are the 13 parameters means? Could you offer a list to explain all of the alternative parameters as you did in the API? For example, you wrote: S_d (float) – Mean selection coefficient for deleterious mutations, b (float) – Shape parameter for gamma distribution, p_b (float) – Probability of a beneficial mutation... Could you also offer the explaination for all of the alternative parameters?

The DFE is the probability distribution of selection coefficients for all considered (non-synonymous) sites, collectively.

Did you mean that fastDFE will output the DFE for all the 0-fold degenerate sites, no matter whether there are mutations on those sites?

I'm not entirely sure about your question, but the inferred DFE can be interpreted as the most likely probability distribution of (population-scaled) selection coefficients that gives rise to the observed allele frequencies of (non-synonymous) sites (as summarized by the selected SFS).

I'm sorry for my poor expression. Let me try to explain myself. According to my understanding, each mutation has it's own fitness effect, maybe 0 (neutral), -0.01 (deleterious), or >0 (beneficial). And the DFE, according to my understanding, shows the proportion of mutations with the corresponding fitness effect. For example, if a population only have one mutation site (which means the genomes of all accessions are identical except for only one site), then the vcf file will only contain one row. If such site is very conservative that any mutations on this site will be very deleterious, should the DFE of this population have "1" in (inf, -100), and 0 in other catagories? Instead, if such mutation is slightly deleterious (nearly neutral), should the DFE of this population have "1" in [-1, 0), and have 0 in other catagories? Am I thinking right? Then, assuming that this population has 2 mutation sites, with site1 is very conservative ((inf, -100)), and site2 nearly neutral ([-1, 0)). Then, assuming a vcf file which only two accessions in site1 have genotype "0/1", and other accessions are "0/0" in site1. And for site2, assuming that only one accession is "0/1", and all the other accessions being "0/0". For this circumstance, this population has two mutation sites (site1: conservative, site2: nearly neutral), and 3 mutations (2 in site1, 1 in site2). By saying the DFE for mutation sites, I meaned that for this circumstance, the DFE for sites should be: {(inf, -100):0.5, [-100, -10):0, [-10, -1):0, [-1, 0):0.5}(total=2site, site1 conservative, site2 nearly neutral) , and the DFE for mutations should be: {(inf, -100):0.66666, [-100, -10):0, [-10, -1):0, [-1, 0):0.33333}(total=3 mutation, 2 in site1 extremely deleterious, 1 in site2 nearly neutral ). This is my understanding for the DFE of sites and DFE of mutations. Am I thinking right? So, what does the fastDFE offer? DFE for sites or mutations as I described above? Or, as you said above, the "total" refer to all the 0-fold degenerate sites, and the proportion of (inf, -100) equal to "the number of 0-fold degenerate sites which have fitness effect between (inf, -100)" / "the number of all the 0-fold degenerate sites"?

Sendrowski commented 4 days ago

Thank you for your reply. But I am still confused about what other parameters can be set to fixed. When I ran joint inference, the output remind me: A large number of parameters is optimized jointly (13). What did the '13' means? Was it saying that 13 parameters were optimized jointly? What are the 13 parameters means? Could you offer a list to explain all of the alternative parameters as you did in the API? For example, you wrote: S_d (float) – Mean selection coefficient for deleterious mutations, b (float) – Shape parameter for gamma distribution, p_b (float) – Probability of a beneficial mutation... Could you also offer the explaination for all of the alternative parameters?

If you don't specify parameters to be shared or fixed, then they will be estimated independently for each type, i.e. you have type1.S_d, type2.S_d, type1.p_b and so on. You should be able to see the estimated parameters indexed by name by accessing JointInference.params_mle.

Did you mean that fastDFE will output the DFE for all the 0-fold degenerate sites, no matter whether there are mutations on those sites?

Yes, all mutations are summarized by means of the SFS, so site-wise information gets lost. You can only increase precision by stratifying your SFS into different classes.

I'm sorry for my poor expression. Let me try to explain myself. According to my understanding, each mutation has it's own fitness effect, maybe 0 (neutral), -0.01 (deleterious), or >0 (beneficial). And the DFE, according to my understanding, shows the proportion of mutations with the corresponding fitness effect. For example, if a population only have one mutation site (which means the genomes of all accessions are identical except for only one site), then the vcf file will only contain one row. If such site is very conservative that any mutations on this site will be very deleterious, should the DFE of this population have "1" in (inf, -100), and 0 in other catagories? Instead, if such mutation is slightly deleterious (nearly neutral), should the DFE of this population have "1" in [-1, 0), and have 0 in other catagories? Am I thinking right? Then, assuming that this population has 2 mutation sites, with site1 is very conservative ((inf, -100)), and site2 nearly neutral ([-1, 0)). Then, assuming a vcf file which only two accessions in site1 have genotype "0/1", and other accessions are "0/0" in site1. And for site2, assuming that only one accession is "0/1", and all the other accessions being "0/0". For this circumstance, this population has two mutation sites (site1: conservative, site2: nearly neutral), and 3 mutations (2 in site1, 1 in site2). By saying the DFE for mutation sites, I meaned that for this circumstance, the DFE for sites should be: {(inf, -100):0.5, [-100, -10):0, [-10, -1):0, [-1, 0):0.5}(total=2site, site1 conservative, site2 nearly neutral) , and the DFE for mutations should be: {(inf, -100):0.66666, [-100, -10):0, [-10, -1):0, [-1, 0):0.33333}(total=3 mutation, 2 in site1 extremely deleterious, 1 in site2 nearly neutral ). This is my understanding for the DFE of sites and DFE of mutations. Am I thinking right? So, what does the fastDFE offer? DFE for sites or mutations as I described above? Or, as you said above, the "total" refer to all the 0-fold degenerate sites, and the proportion of (inf, -100) equal to "the number of 0-fold degenerate sites which have fitness effect between (inf, -100)" / "the number of all the 0-fold degenerate sites"?

Given that we know the selection coefficients of the sites precisely, what you call DFE for sites matches the DFE most closely. The allele frequencies of the variant sites are what provides information on their selection coefficients (we usually assume there is only one mutation per sites whose frequency then increases or decreases over time).

I hope this answers your questions