ANGSD / angsd

Program for analysing NGS data.
231 stars 51 forks source link

How to remove the low frequency sites when calculate Fst #103

Closed biozzq closed 5 years ago

biozzq commented 7 years ago

Dear @ANGSD

I found the low frequency sites sometimes may be a confounding effect on the Fst statistical analysis. I wonder if some parameters can handle this ?

Best Zhuqing

mfumagalli commented 7 years ago

Dear Zhuqing,

you can do snp calling to remove sites that are clearly non polymorphic in any of the populations to remove the noise associated to monomorphic sites or rare variants.

Best

Matteo

On 29 August 2017 at 16:35, biozzq notifications@github.com wrote:

Dear @ANGSD https://github.com/angsd

I found the low frequency sites sometimes may be a confounding effect on the Fst statistical analysis. I wonder if some parameters can handle this ?

Best Zhuqing

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/103, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGvCauMCTRwbPFte5SGjq5fSJWckI8Dks5sdCHLgaJpZM4PGCrG .

biozzq commented 7 years ago

Dear @mfumagalli

Thank you. Do you mean I should do snp calling (hard-called genotype, but this is not recommended when using ANGSD) before running realSFS, which means I should select the set of variants used in the computation. I think I also should include the monomorphic sites to avoid biases when calculating summary statistics (eg, Tajima's D, weighted Fst). We should treat the rare variants as the monomorphic sites. But I do not know how to realize this. Thanks.

Best Zhuqing

mfumagalli commented 7 years ago

Hi Zhuqing,

I was not suggesting doing snp calling by hard genotype calling but to remove some sites clearly monomorphic to reduce the noise associated to estimates of FST and Tajima's D Btw, FST and Tajima's D are not affected by the number of monomorphic sites but if you "mask" your rare variants as invariable, that will affect your estimates.

Best

Matteo

On 30 August 2017 at 05:10, biozzq notifications@github.com wrote:

Dear @mfumagalli https://github.com/mfumagalli

Thank you. Do you mean I should do snp calling (hard-called genotype, but this is not recommended when using ANGSD) before running realSFS, which means I should select the set of variants used in the computation. I think I also should include the monomorphic sites to avoid biases when calculating summary statistics (eg, Tajima's D, weighted Fst). We should treat the rare variants as the monomorphic sites. But I do not know how to realize this. Thanks.

Best Zhuqing

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/103#issuecomment-325866897, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGvCSC-jE0liNblBAulIWDyVuAH0w2Qks5sdNKSgaJpZM4PGCrG .

biozzq commented 7 years ago

Dear @mfumagalli

Thank you. You mean I can just use the union common (eg, filter by minor allele frequency) variants to prepare saf for each population (using -sites parameter when running angsd). However, when I prepare 2dsfs file, I think I should keep all sites not just using the common variants. Is this right?

Best Zhuqing

biozzq commented 7 years ago

Dear @mfumagalli

What is more, can I just prepare the 2dsfs using all the fourfold degenerate sites ?

Best Zhuqing

ANGSD commented 7 years ago

Dear Zhuqing, that is actually an excellent idea to make it possible to use the 4folddegenerate sites. Ill create a GitHub issue so i dont forget. Then it will be easier to look into syn and nonsyn spectra etc.

Regarding this issue, can you elaborate on the confounding effect?

On Thu, Aug 31, 2017 at 6:04 AM, biozzq notifications@github.com wrote:

Dear @mfumagalli https://github.com/mfumagalli

What is more, can I just prepare the 2dsfs using all the fourfold degenerate sites ?

Best Zhuqing

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/103#issuecomment-326183699, or mute the thread https://github.com/notifications/unsubscribe-auth/AGDo7gT34s7iZ_DJ__B9u4p8KRxkNl_0ks5sdjC4gaJpZM4PGCrG .

biozzq commented 7 years ago

Dear @mfumagalli @ANGSD

Many thanks, I will try to use fourfold degenerate sits to produce spectra which will be used in the summary statistics.

For me, I have calculated fst using sliding window method and I focused on the top three windows. When I took a look on these windows, I found the top two have significant more rare variants (MAF < 0.01) then the third one (I mainly focus on the genes included in this window, so I want it can be the highest window). When I removed these rare variants, although the global pattern has not changed, but the third one has become the highest in all windows (This will be easy for me to discuss these genes in our study). So i decide to remove these rare variants. I think this paper has discussed the confounding effect (Bhatia G, Patterson N, Sankararaman S, et al. Estimating and interpreting FST: the impact of rare variants[J]. Genome research, 2013, 23(9): 1514-1521.).

What is more, when we calculated ThetaD (nucleotide diversity), I think we should not remove these rare variants, is this right?

Best Zhuqing

biozzq commented 7 years ago

Dear @ANGSD @mfumagalli

The problem of rare variants has confused me for a long time. I want to make sure that I am doing the right things when I do summary statistics.

  1. generate spectra just using fourfold degenerate sites which mutations are putatively neutral.
  2. generate Fst using the common variants
  3. generate other summary statistics (ThetaD (eg, nucleotide diversity), Tajima's D) using all sites (including the monomorphic sites as background, rare variants and common variants)

Thank you!

Best Zhuqing

mfumagalli commented 7 years ago

Dear Zhuqing,

it depends on how you classify a variant as rare or common, and I suspect that your FST values will be biased if you remove low frequency variants, but again it depends on your question.

If I were you and your data set is indeed of several samples with low-depth per-sample, then I would filter out clearly monomorphic sites (e.g. -snp_pval 1e-2), just to remove a bit of noise and reduce your data, and then use all these sites with ANGSD to estimate both FST and other summary stats. Just my advice.

Best

Matteo

On 7 September 2017 at 02:22, biozzq notifications@github.com wrote:

Dear @ANGSD https://github.com/angsd @mfumagalli https://github.com/mfumagalli

The problem of rare variants has confused me for a long time. I want to make sure that I am doing the right things when I do summary statistics.

  1. generate spectra just using fourfold degenerate sites which mutations are putatively neutral.
  2. generate Fst using the common variants
  3. generate other summary statistics (ThetaD (eg, nucleotide diversity), Tajima's D) using all sites (including the monomorphic sites as background, rare variants and common variants)

Thank you!

Best Zhuqing

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/103#issuecomment-327655726, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGvCU8H9aE4aaEc_XYaHw1fUyhMluR6ks5sf0VsgaJpZM4PGCrG .

biozzq commented 7 years ago

Dear @mfumagalli

Thanks. I think I will treat the singleton (the class of SNPs most influenced by recent population growth ) as the rare variants. You mean remove these variants will biased the results and depend on our own studies. The confounding effect on estimating FST has been discussed in this paper (Bhatia G, Patterson N, Sankararaman S, et al. Estimating and interpreting FST: the impact of rare variants[J]. Genome research, 2013, 23(9): 1514-1521.). So I think we can remove these singletons (fre = 1/2n) before estimating FST, but not for other summary statistics (ThetaD (eg, nucleotide diversity), Tajima's D).

More, according to you, if I have many different subpopulations with unbalanced sample size, I want to know, how can i filter out the clearly monomorphic sites, by each subpopulation or combined all populations?

Meanwhile, we should generate the spectra before do summary statistics, can I just using the fourfold degenerate sites?

Best Zhuqing

mfumagalli commented 7 years ago

I would filter our monomorphic sites based on all pops, the combined samples.

for the second point, it depends again what you want to do with the SFS, if only as prior information for other summary stats or for direct demographic inferences; if the latter, using the 4fold-deg sites is a good option

M

On 7 September 2017 at 16:36, biozzq notifications@github.com wrote:

Dear @mfumagalli https://github.com/mfumagalli

Thanks. I think I will treat the singleton (the class of SNPs most influenced by recent population growth ) as the rare variants. You mean remove these variants will biased the results and depend on our own studies. The confounding effect on estimating FST has been discussed in this paper (Bhatia G, Patterson N, Sankararaman S, et al. Estimating and interpreting FST: the impact of rare variants[J]. Genome research, 2013, 23(9): 1514-1521.). So I think we can remove these singletons (fre = 1/2n) before estimating FST, but not for other summary statistics (ThetaD (eg, nucleotide diversity), Tajima's D).

More, according to you, if I have many different subpopulations with unbalanced sample size, I want to know, how can i filter out the clearly monomorphic sites, by each subpopulation or combined all populations?

Meanwhile, we should generate the spectra before do summary statistics, can I just using the fourfold degenerate sites?

Best Zhuqing

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/103#issuecomment-327837634, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGvCYtj6zBZcV7t_pzKtivQO_IYU_CWks5sgA1ogaJpZM4PGCrG .

biozzq commented 7 years ago

Dear @mfumagalli

I just want to use the sfs as prior information for summary stats. Is it ok for using the 4fold-deg sites?

Best Zhuqing

ANGSD commented 5 years ago

Dear all, I think this issue got resolved, im closing this issue. Otherwise feel free to reopen.

Best