ANGSD / angsd

Program for analysing NGS data.
229 stars 50 forks source link

HWE filtering #209

Open AlexLemo opened 5 years ago

AlexLemo commented 5 years ago

Dear Angsd group,

I am trying to filter loci outside of Hardy-weinberg equilibrium out of my dataset and I have some questions since I am unsure about this filtering.

1) I ran two different test runs with all the same parameters, and ony changing HWE related filters. The first one was run using -doHWE 1 -hwe_pval 0.05 and the second one using -HWE_pval_F 0.05. The results are extremely different as I obtain 15600 SNPs in the first run and 7500 in the second. I am not sure I am really understanding the differences between these filters and why there is so much differences between the output?

2) In order to get my pvalue threshold for HWE correction I would like to account for multiple testing. Is it ok to run the analysis without any HW filter, get the total number of loci and then run the analysis again and add a bonferonni corrected threshold of 0.05/(number of loci) ?

Thanks for your help,

Cheers, Alexandre

z0on commented 5 years ago

Hi Alexandre - a collaborator of mine, Nate Pope (cc'd), has recently come up with this neat procedure to identify sites that contain more than 50% of heterozygotes with a given probability (based on genotype likelihoods).

His script (HetMajorityProb.py) is attached.

It uses output by ANGSD run with options "-doPost 2 -doGeno 11”. Output (for each genotyped site): contig, position, number non-missing samples, number "hard-called" heterozygotes, expected num heterozygotes, probability of heterozygote majority. Requires numpy, scipy, and poibin (place attached poibin.py in your PYTHONPATH)

The idea is to first run angsd without HWE filter, then use this HetMajorityProb.py and awk to collect the sites in which probability of heterozygote majority is less than say 75%, index them, and then run subsequent angsd commands only on these sites (with -sites argument).

This works really well for me - the annoying “blip” in the AFS at p=0.5 is gone.

Example use:

FILTERS="-uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 20 -minQ 25 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -minInd 20 " TODO="-doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doPost 2 -doGeno 11" angsd -b bams -GL 1 -P 1 $FILTERS $TODO -out sfilt zcat sfilt.geno.gz | python ~/bin/HetMajorityProb.py | awk '\$6 < 0.75 {print \$1\"\t\"\$2}' > filteredSites angsd sites index filteredSites

Misha

On Mar 15, 2019, at 9:16 AM, AlexLemo notifications@github.com wrote:

Dear Angsd group,

I am trying to filter loci outside of Hardy-weinberg equilibrium out of my dataset and I have some questions since I am unsure about this filtering.

• I ran two different test runs with all the same parameters, and ony changing HWE related filters. The first one was run using -doHWE 1 -hwe_pval 0.05 and the second one using -HWE_pval_F 0.05. The results are extremely different as I obtain 15600 SNPs in the first run and 7500 in the second. I am not sure I am really understanding the differences between these filters and why there is so much differences between the output?

• In order to get my pvalue threshold for HWE correction I would like to account for multiple testing. Is it ok to run the analysis without any HW filter, get the total number of loci and then run the analysis again and add a bonferonni corrected threshold of 0.05/(number of loci) ?

Thanks for your help,

Cheers, Alexandre

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

AlexLemo commented 5 years ago

Hi,

Thank you Misha, this is something worth considering.

However, I still don't understand the difference between the HWE filters... I did some additional testing, and found some differences, but I still fail to really get their implication. There are differences in F values (scale departure from HWE), but also in p-values derived from them. Basically:

Case A--- HWE_PVAl_F forces the F value to be constrained between 0 (absence of deviation) and 1 (total deviation). As a consequence, a large portion of the dataset (when F should be < 1) gets 0 as a value for F. Thus, the related p-values = 1 (i.e absence of deviation from HWE).

Case B--- doHWE 1 -minpvalHWE [threshold] behaves differently. Here, F values are not constrained and can range between -1 and 1. Thus, there is no large peak at p-value = 1.

As a result, P-values when F > 0 are totally correlated between the two filters. However, p-values derived from F<=0 are different between case A (Including all the clipped values) and B ( Where values are not clipped and can be negative).

So, what does this exactly mean? Is it 100% certain that loci clipped at F = 0 (Case A) are within HWE, or is it so that we cannot be certain because of the F clipping (i.e for some reason the algorithm cannot calculate negative F values) ? And how to explain the differences between the two filters?

Moreover, which algorithm would be best used in the case where someone would simply want to trim loci that are not under HWE ?

Thanks for the help, Cheers, Alexandre

nspope commented 5 years ago

Alexandre,

I think it comes down to what exactly you are trying to exclude from your data. Inbreeding in the strict sense should always reduce heterozygosity**. This implies that 0 <= F <= 1. However, there are many processes that would cause departures from HWE by causing heterozygosity to increase: for example, balancing selection, or artefacts from collapsed paralogs. In which case -1 <= F <= 1. So, I find it most useful to think in terms of (a.) filtering only sites that have a significant deficit of heterozygotes, or (b.) filtering sites that have either a significant excess or significant deficit of heterozygotes. If I want sites that conform to HWE regardless of what process might be causing deviations, I'd choose the latter.

**You can see this from a derivation of the inbreeding coefficient. Let me be unnecessarily pedantic (sorry) and let G(i,j) be the genotype frequency in a population for a diallelic locus with alleles i,j. Let P be the frequency of allele i. Assume that a fraction E of individuals mate only with individuals of the same genotype and the rest are randomly mating. Then, the genotype frequency in the next generation,

G(i,j)' = 2 P (1-P) (1-E) + 0.5 G(i,j) E ...... (heterozygous, i != j) G(i,i)' = P^2 (1-E) + (G(i,i) + 1/4 G(i,j)) E ...... (homozygous for i)

At equilibrium G(i,j)' = G(i,j) = G(i,j)* so,

G(i,j) = 2P(1-P)(1 - E/(2-E)) ..... (heterozygous, i != j) G(i,i) = P^2 + P(1-P) E/(2-E) ..... (homozygous for i)

The inbreeding coefficient, F = E/(2-E). Because E is necessarily in [0,1], then F is also in [0,1] and any F > 0 causes a deficit in heterozygotes. However, if you define F not in terms of E (inbreeding), but as a "general" departure from HWE, then there is no reason why F shouldn't be bounded [-1,1].

nspope commented 5 years ago

Also, responding to (2.) in your original question, accounting for multiple testing in this context means being more permissive in terms of what you let in. A family-wise type I error correction such as Bonferonni will be extraordinarily permissive with thousands SNPs, and so would essentially negate the filter. I'd either use uncorrected p-values as angsd's filter already does (the conservative choice) or if you must, correct for false discovery rate (not family-wise rate)

aalbrechtsen commented 4 years ago

Dear Alexandre You and nspope have already answered the question your self to a large extent but I will add my own take on it aswell.

-HWE_pval_F is like other maximum likelihood methods restricted to F being [0,1] (because F is this model is a probabilty of being inbreed). Therefore, it does not allow F to be negativ will means that it cannot detect diviations from HWE caused by excess heterozygoes genotypes.

-HWE_pval is more like the commenly use momen estimators (like in plink) where we can compare the expected vs. the observed genotypes. This allows for negative F ( a weird concept) which refleks an excess of heteroxygoes.

I would use HWE_pval unless you want to estimate the amount of inbreeding F where F interpreted as the probability of an individual being inbreed.

Regarding 2) the multiple testing then there is no answer that most people will agree on and the choice depends on what you want to achieve with a HWE filter. Often people just use threshold of p-value 1e-6 which will only remove a low fraction of sites under the nulll. However, you can be more strict a use a higher cutoff or you can use the estimate of F as cutoff such -0.2>F >0.2.

-Anders