ANGSD / angsd

Program for analysing NGS data.
230 stars 50 forks source link

Best way to calculate per site heterozygosity Hobs #248

Open clairemerot opened 5 years ago

clairemerot commented 5 years ago

Hello, We would like to compare within a portion of the genome the heterozygosity between different groups. For instance, for each snp estimate what is the proportion of heterozygotes in group 1 and what is the proportion in group 2. This will be for markers that we expect to be skewed such as putative sex chromosome/rearrangement. And I was wondering whether this would be possible in angsd taking into account the uncertainty of low coverage data.

Is it already available in one of the statistics? Or should we better be computing it manually from the beagle genotype likelihood output? Or from the output given by genotype calling -dogeno or -dopost and is it likely to be baised?

Thanks a lot for any help or advice! (and thanks again for such as useful tool as angsd! I'm sorry to be posting questions again... ) Claire

ANGSD commented 5 years ago

How many groups do you have?

clairemerot commented 5 years ago

Thanks for following-up. We will start by looking at each group alone and then comparing 2 groups or 3 groups... but does it make a difference? The idea is simply to have for each snp & each group, the Hobs (observed heterozygosity - % of heterozygotes?) Cheers, Claire

ANGSD commented 5 years ago

Have you considered doing some like pairwise fst or pbs?

Den lør. 24. aug. 2019 kl. 15.26 skrev Claire Mérot < notifications@github.com>:

Thanks for following-up. We will start by looking at each group alone and then comparing 2 groups or 3 groups... but does it make a difference? The idea is simply to have for each snp, the Hobs (observed heterozygosity - % of heterozygotes?) Cheers, Claire

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/248?email_source=notifications&email_token=ABQOR3U5CSWXADXT4DKY5KLQGEZGXA5CNFSM4IPC6YLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5B77JQ#issuecomment-524550054, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQOR3RJKHI2SHJNUK6UYBTQGEZGXANCNFSM4IPC6YLA .

nspope commented 5 years ago

Hi Claire, see Heng Li's 2011 paper "A statistical framework ..."

In section 2.3.2 he gives a EM algorithm for estimating genotype frequencies from genotype likelihoods, which will give an estimate of the %heterozygotes per site. This is straightforward to implement yourself after dumping genotype likelihoods as a beagle-style text or flat binary file (-doGlf 2/3 with -doMajorMinor).

Alternatively you could calculate genotype probabilities, treat individuals as exchangeable, and calculate the pmf for the number of heterozygotes similarly to https://github.com/ANGSD/angsd/issues/156#issuecomment-447607684

Or ... would simply calculating inbreeding coefficients/HWE stats per site with -doHWE answer your question? This will get at excess/dearth of heterozygotes relative to estimated allele frequencies.

Nate

clairemerot commented 5 years ago

Thanks a lot from your answers... It seems indeed that the simplest way to get a % of observed heterozygotes will be to take advantage of the doHWE, thanks for pointing into that direction!

So assuming that the 6th column "F" is 1- Hobs/Hexp, (the scale departure from HWE (inbreeding coefficient - see model) and using the third column "hweFreq" (the allele frequency assuming HWE (same as -doMaf 1))

I could reconstruct Hobs= Hexp - F Hexp with Hexp = 2(hweFreq)(1-hweFreq)

We will look into that more deeply with the data and I'll try to post feedback. Thanks a lot!

(We also calculated FST previously, but here we are really interested to have a real count of heterozygotes). I'll look further in the paper/or the python script if we also need the frequencies of the homozygotes.

Thank you very much for the help! Claire

cteenis commented 4 years ago

Thanks a lot from your answers... It seems indeed that the simplest way to get a % of observed heterozygotes will be to take advantage of the doHWE, thanks for pointing into that direction!

So assuming that the 6th column "F" is 1- Hobs/Hexp, (the scale departure from HWE (inbreeding coefficient - see model) and using the third column "hweFreq" (the allele frequency assuming HWE (same as -doMaf 1))

I could reconstruct Hobs= Hexp - F Hexp with Hexp = 2(hweFreq)(1-hweFreq)

We will look into that more deeply with the data and I'll try to post feedback. Thanks a lot!

(We also calculated FST previously, but here we are really interested to have a real count of heterozygotes). I'll look further in the paper/or the python script if we also need the frequencies of the homozygotes.

Thank you very much for the help! Claire

@clairemerot Hi Claire, we also want to caculate heterozigosities and I apprepriate your feeback on how this works out for you? Thank you!

nspope commented 4 years ago

@cteenis there's now a built-in way to do this via the -maxHetFreq and -minHetFreq arguments (part of -doHWE). If you set one of these arguments to be totally permissive (e.g. -minHetFreq -1.0) then ANGSD won't filter sites on basis of proportion heterozygotes, but will add a column with the estimated heterozygote frequency to prefix.hwe.gz ... see pull request #294

clairemerot commented 4 years ago

Hi all, the builtin option is probably better by now. that's great to have it within Angsd, thanks! I'll check the update. If needed, I uploaded my script to get observed heterozygosity on github here https://github.com/clairemerot/angsd_pipeline/blob/master/01_scripts/Rscripts/Hobs_sliding.r This script takes the output of the previous -dohwe option in angsd and output Hobs by position and averaged across sliding-windows. Good luck Claire

cteenis commented 4 years ago

@nspope @clairemerot Thank you very much for the answer! Will try the -maxHetFreq and -minHetFreq arguments.

madzafv commented 4 years ago

Once we have the per site Hets, which way you suggest to go about calculating in windows? Windowscanr as in claire's script? -thanks