DivyaratanPopli / Kinship_Inference

This is a tool to estimate pairwise relatedness from ancient DNA, taking in account contamination, ROH, ascertainment bias.
GNU General Public License v3.0
7 stars 2 forks source link

analysis of recently admixed populations? #26

Open zmaroti opened 2 weeks ago

zmaroti commented 2 weeks ago

Hi,

We wanted to test the method on people with close relatives. We had used ancIBD, READ, and correctKin to confirm close relations and also indicate likely unrelated people (no IBD length shared and other methods indicating them as unrelated though we are avare that READ can only indicate up to 2nd relations so exclusion was mainly based on ancIBD and correctKin).

Our individuals are in a genetic cline. It is known from historical events that migrants overlayed the locals at the given period and this could be also seen on the distribution of the samples on PCA furthermore they could be modelled by qpAdm using the unadmixed homogenous sources of migrants/locals.

We also wanted to apply KIN to confirm, our findings with other methods as it offers to indicate more distant relations than READ.

We have minimal contamination (confirmed by ANGSD X contamination, and MT/Schmutzi as well). We used the default options with KINgaroo (with -c 0) and also with KIN.

We analyzed 18 samples from the same cemetery, and added 48 more population matched people (based on qpADM and PCA analysis) to have a good enough model for p_0 calculations.

Our BAM files are sorted, dedupped, and overlapping PE reads are merged by ATLAS (the BQ of the nucleotides of the overlapping part is set to 0). This allows AGSD and other tools to exclude low BQ bases from the analysis (effectively low PE data can be used without counting one read twice for the GTs at any position).

For the markers we used the sites of 1240K data set as they are supposed to contain highly POP specific AIMs. We are not aware that KIN has a GUIDE on what markers et should be used for the analysis (could be our ignorance but we couldn't find it in the article, not in the example targets, github repo, etc what is recommended for real world analysis). However it was tested on a plethora of aDNA tools that it is suitable for population genetic analyses, furthermore ancIBD, READ, lcMLkin, correctKin can also work with AADR marker set data so in theory it should be sufficient for IBD/kinship analysis as well.

Surprisingly KIN indicated lots of 'identical' members with very high log likelihoods (I've anonymized the sample IDs), that are for sure invalid as they share negligible IBD by ancIBD, and detected as unrelated by READ and correctKIN:

102 SampleA_merged_dedupmergedReads._SampleB_merged_dedup_mergedReads Identical Siblings 14.756 0.0 0.0 1.0 0 1 103 SampleA_merged_dedupmergedReads._SampleC_merged_dedup_mergedReads Identical Siblings 25.516 0.0 0.0 1.0 0 1

READ indicated no relations for the same sample pairs:

PairIndividuals Relationship Z_upper Z_lower SampleASampleB Unrelated NA -50.95743382509111 SampleCSampleA Unrelated NA -45.712853493865126 (order of samples were only different)

We have 47 identical sample pairs indicated by KIN. I've double checked no pairs are counted twice (with different order) as the number of relations (2145 +1 line for header) is equal with the expected N*(N-1)/2 combinations between the 66 samples.

I also checked that I did not mess up the BAM files. And in the splitbams directory I've confirmed that the indicated samples for the given chromosome (the actual split alignment data that was used for the analysis) had different sizes, and different reads.

I am not sure whether the 1240K markers should be thinned by LD, however ROH identification/masking suggests that KIN shouldn't be sensitive for this (ancIBD is happy to work with the full data set).

We have to note that our samples are not very low mean genome coverage (1.5x or better). We had seen that KIN surprisingly results in more FP at 2nd degree or higher for higher coverage data (FIG5 of the manuscript) however for identical samples there are virtually no FPs in your analysis not for KIN neither for READ.

We can likely reject that this gross error is due to the minimal contamination and the applied '-c 0' option. So I kind of concluded that unlike READ, KIN may be very sensitive to non fully stratified recently admixed individuals.

Could you please provide, what signs we should see that our analysis lacks power? At the interpretation of results you mention, that

hmm_parameters/p_0.txt : It has one float value representing average pairwise difference for unrelated individuals. While comparing to other methods like READ, one can compare p_0 to corresponding measure for background diversity.

In our case p_0 is 0.002434530570323989

I've checked p_all.csv and identical_p_all.csv and those numbers were also in this range for all samples.

Is there any recommendation you could suggest for our use case? If we analyzed only the 18 individuals of the same cemetery we still got 9 identical (2 with 20+ likelihood) samples by KIN, though again the situation is the same (individuals are admixed, non fully stratified individuals, however this is very common for ancient data/cemetery).

So we are stuck and unsure whether we have some technical issue that we are not aware of. And if so, how to re-analyze our data. Or whether the differences in the population structure has much greater effect on KIN than for the other tested methods contradicting these results (ancIBD, READ, or correctKin).