Open jbruxaux opened 2 years ago
Quick update...
I thought it could be linked to the sites used in the different analyses, and realized that usually we consider only variable sites to calculate the inbreeding coefficient (correct me there if I'm wrong). So I re-ran the ANGSD analysis with the -doHWE and the -SNP_pval parameter to consider only sites that are actually variable in each population. I get values that are much more similar to what was published before (good point), so I will keep that for the moment.
ngsF | doHWE | doHWE SNP_pval |
---|---|---|
0.150 | 0.132 | 0.020 |
0.135 | 0.100 | 0.029 |
0.139 | 0.098 | 0.028 |
0.106 | 0.084 | 0.018 |
0.195 | 0.203 | 0.026 |
0.119 | 0.095 | 0.028 |
0.115 | 0.115 | 0.002 |
However it doesn't solve completely the difference between the software. My new guess is that, since I used different dataset (by population or for the species), the HWE is not considered the same. I will try to use different populations dataset with ngsF and see if I recover similar values.
Last update:
I now used the exact same command to produce the glf file and the hwe file (using the -SNP_pval command). So I should have the exact same data used by ANGSD and ngsF. But the results are still very different:
ngsF | doHWE | doHWE SNP_pval | ngsF subpop SNP_pval |
---|---|---|---|
0.150 | 0.132 | 0.020 | 0.047 |
0.135 | 0.100 | 0.029 | 0.050 |
0.139 | 0.098 | 0.028 | 0.053 |
0.106 | 0.084 | 0.018 | 0.054 |
0.195 | 0.203 | 0.026 | 0.079 |
0.119 | 0.095 | 0.028 | 0.067 |
0.115 | 0.115 | 0.002 | 0.044 |
No guess anymore to why it is (so) different... Any idea?
Dear jbruxeaux, I have no answers but a question: could you share your code for getting the inbreeding coefficient / HWE values. I am interested in exactly your question but my code is not working :-/
Thanks in advance, Christine
Hi Christine!
Sorry for the delayed answer, our cluster was off for a week...
I called angsd for each population using the command:
angsd -doCounts 1 -setMinDepthInd 5 -GL 2 -out ${path}/ANGSD_${pop}_SNP -ref $ref -nThreads 18 -bam ${bam_file} -minMapQ 40 -minQ 20 -sites ${path}/final_SNP_set_maf0.05.txt -doMajorMinor 1 -doHWE 1 -doMaf 1 -SNP_pval 1e-6 -doGlf 3
Then extracted the inbreeding coefficient by site:
zcat ${path}/ANGSD_${pop}_SNP.hwe.gz | awk '{if (NR>1) {print $7}}' > ${path}/ANGSD_${pop}_SNP.FIS_by_site
And if you want the averaged value:
echo $pop > ${path}/ANGSD_${pop}_SNP.FIS_average
awk '{sum+=$1} END {print "FIS: " sum/ NR}' ${path}/ANGSD_${pop}_SNP.FIS_by_site >> ${path}/ANGSD_${pop}_SNP.FIS_average
Hope it helps! =)
Hi!
I am working on a dataset where I would like to get the averaged inbreeding coefficient by population, to compare with previous published estimates. I used a first method where I ran ANGSD on each population with the options -doHWE 1 and I got this value by site in the column 7 of the .hwe.gz file, and I averaged over the length of my dataset. I tried another method, where I ran an ANGSD analysis on the same dataset but with all the samples, and then used ngsF to get the individual FIS, and averaged by population. Both results are correlated, but there are substantial differences anyway (R2 "only" 0.762). A few examples:
What do you think would be the best method in this case? The "classical" method using vcftools implies averaging individual values, so the ngsF method would be similar in that sense... But it does not mean it is more reliable...
Thanks for your advice!