Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

[Support] Using PCAngsdv2 to determine kinship #58

Open fidibidi opened 2 years ago

fidibidi commented 2 years ago

Hi All!

We have been trying to set up PCAngsd/ANGSD, to test for relatedness among our clinical datasets. We've been following along with the methods that were used in the "Whole genome analysis sheds light on the genetic origin of Huns, Avars and conquering Hungarians" paper.

In particular this section:

"Presence of close relatives in the dataset interferes with unsupervised ADMIXTURE and population genetic analysis, therefore we identified close kins and just one of them was left in the dataset (Supplementary Table 9). We performed kinship analysis using the 1240K data set and the PCAangsd software (version 0.931)(Meisner and Albrechtsen 2018) from the ANGSD package with the “-inbreed 1 -kinship” options. We used the R (version 4.1.2); the RcppCNPy R package (version 0.2.10) to import the Numpy output files of PCAangsd."

Based on this, I took a trio and ran it through ANGSD to generate a BEAGLE file.

angsd -GL 1 -out CA0346.angsd -nThreads 4 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 1e-6 -bam CA0346.bamslist

The beagle file looked as such:

marker allele1 allele2 Ind0 Ind0 Ind0 Ind1 Ind1 Ind1 Ind2 Ind2 Ind2
chr1_14907 2 0 0.000793 0.999207 0.000000 0.003548 0.996452 0.000000 0.000000 0.999997 0.000003
chr1_14930 0 2 0.000000 0.999953 0.000047 0.000000 0.998436 0.001564 0.105530 0.894445 0.000025
chr1_14976 2 0 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000
chr1_15118 0 2 0.000005 0.999995 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000
chr1_15211 2 3 0.000000 1.000000 0.000000 0.000011 0.999989 0.000000 0.000000 1.000000 0.000000
chr1_15274 3 2 0.000000 1.000000 0.000000 0.000790 0.999210 0.000000 0.000001 0.999999 0.000000
chr1_49272 2 0 0.051665 0.948335 0.000000 0.001532 0.998468 0.000000 0.001330 0.998670 0.000000
chr1_49298 1 3 0.441472 0.558528 0.000000 0.000054 0.999946 0.000000 0.003694 0.996306 0.000000
chr1_51803 3 1 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000

Then taking this beagle file, I ran pcangsd on it:

python pcangsd.py -beagle ~/data/CA0346.angsd.beagle.gz -o ~/data/Test-inbreed-3 -inbreed 1 -kinship -threads 4

Resulting in the following output files: ( which I have output using jupyter notebook for clarity )

inbreed [-0.40427923 -0.84753537 -0.85559165] kinship: [[ 0.15704742 -0.0919658 -0.08373455] [-0.0919658 0.05300118 0.05168671] [-0.08373455 0.05168671 0.04196488]] covariance: [[ 0.33447766 -0.15413524 -0.18028733] [-0.15413524 0.75423872 -0.59264392] [-0.18028733 -0.59264392 0.77522612]]

Unfortunately, the documentation for the PCAngsd is lacking in explanation of output.

If anyone could help us interpret this output, and verify that our commands, and process were correct, that'd be a tremendous help!

We were expecting 2 first degree relationships (child parent) and one unrelated (parents).

Thank you for this cool software, and have a good one! Fidi