ANGSD / angsd

Program for analysing NGS data.
228 stars 50 forks source link

Estimation of inbreeding coefficient by population #475

Open jbruxaux opened 2 years ago

jbruxaux commented 2 years ago

Hi!

I am working on a dataset where I would like to get the averaged inbreeding coefficient by population, to compare with previous published estimates. I used a first method where I ran ANGSD on each population with the options -doHWE 1 and I got this value by site in the column 7 of the .hwe.gz file, and I averaged over the length of my dataset. I tried another method, where I ran an ANGSD analysis on the same dataset but with all the samples, and then used ngsF to get the individual FIS, and averaged by population. Both results are correlated, but there are substantial differences anyway (R2 "only" 0.762). A few examples:

ngsF doHWE
0.150 0.132
0.135 0.100
0.139 0.098
0.106 0.084
0.195 0.203
0.119 0.095
0.115 0.115

What do you think would be the best method in this case? The "classical" method using vcftools implies averaging individual values, so the ngsF method would be similar in that sense... But it does not mean it is more reliable...

Thanks for your advice!

jbruxaux commented 2 years ago

Quick update...

I thought it could be linked to the sites used in the different analyses, and realized that usually we consider only variable sites to calculate the inbreeding coefficient (correct me there if I'm wrong). So I re-ran the ANGSD analysis with the -doHWE and the -SNP_pval parameter to consider only sites that are actually variable in each population. I get values that are much more similar to what was published before (good point), so I will keep that for the moment.

ngsF doHWE doHWE SNP_pval
0.150 0.132 0.020
0.135 0.100 0.029
0.139 0.098 0.028
0.106 0.084 0.018
0.195 0.203 0.026
0.119 0.095 0.028
0.115 0.115 0.002

However it doesn't solve completely the difference between the software. My new guess is that, since I used different dataset (by population or for the species), the HWE is not considered the same. I will try to use different populations dataset with ngsF and see if I recover similar values.

jbruxaux commented 2 years ago

Last update:

I now used the exact same command to produce the glf file and the hwe file (using the -SNP_pval command). So I should have the exact same data used by ANGSD and ngsF. But the results are still very different:

ngsF doHWE doHWE SNP_pval ngsF subpop SNP_pval
0.150 0.132 0.020 0.047
0.135 0.100 0.029 0.050
0.139 0.098 0.028 0.053
0.106 0.084 0.018 0.054
0.195 0.203 0.026 0.079
0.119 0.095 0.028 0.067
0.115 0.115 0.002 0.044

No guess anymore to why it is (so) different... Any idea?

EwersAquaGenomics commented 1 year ago

Dear jbruxeaux, I have no answers but a question: could you share your code for getting the inbreeding coefficient / HWE values. I am interested in exactly your question but my code is not working :-/

Thanks in advance, Christine

jbruxaux commented 1 year ago

Hi Christine!

Sorry for the delayed answer, our cluster was off for a week...

I called angsd for each population using the command: angsd -doCounts 1 -setMinDepthInd 5 -GL 2 -out ${path}/ANGSD_${pop}_SNP -ref $ref -nThreads 18 -bam ${bam_file} -minMapQ 40 -minQ 20 -sites ${path}/final_SNP_set_maf0.05.txt -doMajorMinor 1 -doHWE 1 -doMaf 1 -SNP_pval 1e-6 -doGlf 3 Then extracted the inbreeding coefficient by site: zcat ${path}/ANGSD_${pop}_SNP.hwe.gz | awk '{if (NR>1) {print $7}}' > ${path}/ANGSD_${pop}_SNP.FIS_by_site

And if you want the averaged value:

echo $pop > ${path}/ANGSD_${pop}_SNP.FIS_average
awk '{sum+=$1} END {print "FIS: " sum/ NR}' ${path}/ANGSD_${pop}_SNP.FIS_by_site >> ${path}/ANGSD_${pop}_SNP.FIS_average

Hope it helps! =)