analysis of recently admixed populations?

zmaroti commented 5 months ago

Hi,

We wanted to test the method on people with close relatives. We had used ancIBD, READ, and correctKin to confirm close relations and also indicate likely unrelated people (no IBD length shared and other methods indicating them as unrelated though we are avare that READ can only indicate up to 2nd relations so exclusion was mainly based on ancIBD and correctKin).

Our individuals are in a genetic cline. It is known from historical events that migrants overlayed the locals at the given period and this could be also seen on the distribution of the samples on PCA furthermore they could be modelled by qpAdm using the unadmixed homogenous sources of migrants/locals.

We also wanted to apply KIN to confirm, our findings with other methods as it offers to indicate more distant relations than READ.

We have minimal contamination (confirmed by ANGSD X contamination, and MT/Schmutzi as well). We used the default options with KINgaroo (with -c 0) and also with KIN.

We analyzed 18 samples from the same cemetery, and added 48 more population matched people (based on qpADM and PCA analysis) to have a good enough model for p_0 calculations.

Our BAM files are sorted, dedupped, and overlapping PE reads are merged by ATLAS (the BQ of the nucleotides of the overlapping part is set to 0). This allows AGSD and other tools to exclude low BQ bases from the analysis (effectively low PE data can be used without counting one read twice for the GTs at any position).

For the markers we used the sites of 1240K data set as they are supposed to contain highly POP specific AIMs. We are not aware that KIN has a GUIDE on what markers et should be used for the analysis (could be our ignorance but we couldn't find it in the article, not in the example targets, github repo, etc what is recommended for real world analysis). However it was tested on a plethora of aDNA tools that it is suitable for population genetic analyses, furthermore ancIBD, READ, lcMLkin, correctKin can also work with AADR marker set data so in theory it should be sufficient for IBD/kinship analysis as well.

Surprisingly KIN indicated lots of 'identical' members with very high log likelihoods (I've anonymized the sample IDs), that are for sure invalid as they share negligible IBD by ancIBD, and detected as unrelated by READ and correctKIN:

102 SampleA_merged_dedupmergedReads._SampleB_merged_dedup_mergedReads Identical Siblings 14.756 0.0 0.0 1.0 0 1 103 SampleA_merged_dedupmergedReads._SampleC_merged_dedup_mergedReads Identical Siblings 25.516 0.0 0.0 1.0 0 1

READ indicated no relations for the same sample pairs:

PairIndividuals Relationship Z_upper Z_lower SampleASampleB Unrelated NA -50.95743382509111 SampleCSampleA Unrelated NA -45.712853493865126 (order of samples were only different)

We have 47 identical sample pairs indicated by KIN. I've double checked no pairs are counted twice (with different order) as the number of relations (2145 +1 line for header) is equal with the expected N*(N-1)/2 combinations between the 66 samples.

I also checked that I did not mess up the BAM files. And in the splitbams directory I've confirmed that the indicated samples for the given chromosome (the actual split alignment data that was used for the analysis) had different sizes, and different reads.

I am not sure whether the 1240K markers should be thinned by LD, however ROH identification/masking suggests that KIN shouldn't be sensitive for this (ancIBD is happy to work with the full data set).

We have to note that our samples are not very low mean genome coverage (1.5x or better). We had seen that KIN surprisingly results in more FP at 2nd degree or higher for higher coverage data (FIG5 of the manuscript) however for identical samples there are virtually no FPs in your analysis not for KIN neither for READ.

We can likely reject that this gross error is due to the minimal contamination and the applied '-c 0' option. So I kind of concluded that unlike READ, KIN may be very sensitive to non fully stratified recently admixed individuals.

Could you please provide, what signs we should see that our analysis lacks power? At the interpretation of results you mention, that

hmm_parameters/p_0.txt : It has one float value representing average pairwise difference for unrelated individuals. While comparing to other methods like READ, one can compare p_0 to corresponding measure for background diversity.

In our case p_0 is 0.002434530570323989

I've checked p_all.csv and identical_p_all.csv and those numbers were also in this range for all samples.

Is there any recommendation you could suggest for our use case? If we analyzed only the 18 individuals of the same cemetery we still got 9 identical (2 with 20+ likelihood) samples by KIN, though again the situation is the same (individuals are admixed, non fully stratified individuals, however this is very common for ancient data/cemetery).

So we are stuck and unsure whether we have some technical issue that we are not aware of. And if so, how to re-analyze our data. Or whether the differences in the population structure has much greater effect on KIN than for the other tested methods contradicting these results (ancIBD, READ, or correctKin).

DivyaratanPopli commented 3 months ago

Hi, sorry about a late reply, I was busy with my PhD thesis defense. Since identical individuals are very easy to identify, I think the conflict between KIN and other methods could be because of some issue with the implementation. It seems that the p_0 value from KIN is quite low. It could be because the program is looking at off-target sites. In the bed file,can you check that the 2nd and 3rd columns show position and position+1?

For example, I have a 1240K bedfile which begins like this:

1 752565 752566 G A 1 776545 776546 A G 1 832917 832918 T C 1 842012 842013 T G 1 846863 846864 G C

zmaroti commented 3 months ago

Hi,

(sorry for the late reply from me now, we were on holidays and just returned home).

I've checked, we have proper coordinates in our bed file:

$ head 1240K.bed
1       752566  752567  G       A
1       776546  776547  A       G
1       832918  832919  T       C
1       842013  842014  T       G
1       846864  846865  G       C
1       869303  869304  C       T
1       891021  891022  G       A
1       893462  893463  C       T
1       896271  896272  C       T
1       903426  903427  C       T

Since then we also tested the same data with lcMLKin and it was not indicating the given samples as identical either. So we are kind of stuck on what could be the reason here, whether the different genetic structure of the individuals in this cohort of saples or some unknowm technical issue. The BAM files are standard ancient pipe data with seemingly no issues (PCA, qpAdm, all the regular pop genetic tools gave us plausible results, and we don'T have significant exogenous or endogenous contamination /no heteroplasmy on mitochondria, and X contaminatipon of males seems OK, Y typing is ok etc/).

ps: I am not sure what output files could give us hints on whats going on. Seemingly the target samples are all fine (goodlibraries.csv is identical with out target_samples.txt). The marker overlap is kind of typical for the ancient samples considering the 1240k marker set (in the range of 115k - 455k markers), these marker counts are sufficient for the other methods that also uses the 1240K marker set).

///////////////////////////////////////// May be it is nothworthy to mention that we have PE reads in our BAM files, and we "merge" the overlapping portions of the reads by ATLAS (using the updateQuality mergingMethod=keepRandomRead options; thus the BQ of the overlapping bases of PE reads of random pair are set to 0, and the BQ is updated /ie lowered in case the nucleotid mismatches on the paired end reads between/) so a typical short ancient read looks like this in the BAM file:

A01493:81:H7LTTDSX5:1:1306:10872:31219 99 1 12977 0 52M = 12977 52 CAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCA FFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF MC:Z:52M MD:Z:52 PG:Z:MarkDuplicates RG:Z:MDH-656_S33_L001 NM:i:0 AS:i:52 XS:i:52 A01493:81:H7LTTDSX5:1:1306:10872:31219 147 1 12977 0 52M = 12977 -52 CAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! MC:Z:52M MD:Z:52 PG:Z:MarkDuplicates RG:Z:MDH-656_S33_L001 NM:i:0 AS:i:52 XS:i:52

So in one random read the overlapping bases with the other pair the BQ of bases are set to 0. In this case the whole 52 length of DNA fragment is set to BQ 0 on one read as we have 150BP PE reads that fully includes the sequence from both direction. This way angsd and any variant calling method that also onsiders the BQ of the sequence will not use this information (will make randome allele calling unbiassed, when overlapping PE (2x reads) and non overlapping part of PE reads are crossing a marker), furthermore we can avoid most of sequencing errors on majority of our target /since the reads mainly overlaps at 150PE reads, and bases that mismatch between the overlapping portion are downgraded on BQ quality and would fall below the BQ threshold of angsd for example).

We tested that these BAM files are OK for all other tools we use (except schmutzi that expects single reads so we collapse PE reads there). However it could be that it confuses KIN's genotyping pipe.

Sincerely, Zoltan

DivyaratanPopli / Kinship_Inference

analysis of recently admixed populations? #26