Overestimation of assignment to one population in panel?

hyperoplus commented 3 years ago

Hi everyone,

I have recently tried to calculate admixture proportions of a large number of individuals (of several populations) with fastNGSadmix, when compared to a reference panel with two populations (source populations for the colonization of the others). Briefly, the patterns I observe in the output suggest that for some reason the program is overestimating membership to one of the populations in the reference panel. For example, please check this example here: https://drive.google.com/file/d/1HjUdBIUv1SgAklmpkwoO2iRwo7ZTwVYg/view?usp=sharing

The first row is K=2 results for NGSadmix. When I run fastNGSadmix with the full panel with all SNPs, all individuals belong 100% to reference population1. I tried restricting the analysis to ~diagnostic SNPs (allele frequency differences between the two reference populations >0.8 or >0.9, see two bottom rows), and although this somewhat improves the estimation, the pattern is still some way off what I was expecting (Ref pop1 still predominates). We have natural history information on these populations that most of them should be derived from source population2, rather than population1, so the high assignment to population1 of fastNGSadmix is puzzling. Is there some reason why this could be happening?

Best regards, Pedro Andrade

e-jorsboe commented 3 years ago

Hi Pedro,

Thanks for using fastNGSadmix.

I am a bit confused as to what is your reference panel for fastNGSadmix? Perhaps you can explain briefly how you ran it with NGSadmix and how you ran it with fastNGSadmix.

Because "Reference population 1" and "Reference population 2" look different between NGSadmix and the fastNGSadmix. This might be the cause of the issue?

hyperoplus commented 3 years ago

Hi Emil,

For NGSadmix I didn't run a reference panel. I started by running Angsd to generate a Beagle file with the full dataset, and used this as an input into NGSadmix (testing different values of K in each run). So this is a reference panel-free approach, and individuals are assigned to hypothetical clusters. I chose to represent K=2 in the figure I attached because it's the closest to the situation I want to test with fastNGSadmix.

With fastNGSadmix, I'm hoping to calculate admixture proportions of all individuals to each of two reference populations explicitly (putative source groups for the multiple other populations). For this I'm generating, for each individual, a beagle file with genotype likelihoods, and running fastNGSadmix with a custom reference panel I made with allele frequencies of two populations. This a sample of the panel:

id  chr pos name    A0_freq A1  K1  K2
chr1_128182 chr1    128182  chr1_128182 T   C   0.999992    0.074422
chr1_1557805    chr1    1557805 chr1_1557805    C   T   0.999996    0.000004
chr1_1943356    chr1    1943356 chr1_1943356    T   A   0.999992    0.000002
chr1_2257176    chr1    2257176 chr1_2257176    T   A   0.000003    0.918119
chr1_2257227    chr1    2257227 chr1_2257227    T   A   0.000002    0.944008
chr1_4794816    chr1    4794816 chr1_4794816    A   T   0.999996    0.042304
chr1_5108435    chr1    5108435 chr1_5108435    C   T   0.999997    0.086662
chr1_5109540    chr1    5109540 chr1_5109540    T   A   0.956321    0.035957
chr1_6183386    chr1    6183386 chr1_6183386    G   A   0.999995    0.000004

My expectation would be that, in fastNGSadmix, the samples from these populations from the panel would be overwhelmingly assigned to their panel of origin. In the figure you can see that this is true for Rep pop 1, but not Ref pop 2.

Overall, there seems to be a tendency for samples in all populations to be overassigned to Ref pop 1, which seems a bit odd. As I've mentioned in the other post, the results from the figure were calculated based on SNPs which should have high diagnostic value (DAF > 0.9), but assignments were still strange. When I used a panel with all SNPs, every sample belongs 100% to Ref pop 1.

Best, Pedro

e-jorsboe commented 3 years ago

Hi Pedro,

Just so that I can understand your setup and thereby help you in the best way. I have to ask these questions:

Your setup is that with NGSadmix you run all those samples in your plot in one run?

Then for fastNGSadmix you have a reference panel of some other samples? Did you generate the reference panel from a plink file or what, and what is this data? And then you run fastNGSadmix, on each of those samples from the plot (so you analyse the same samples as NGSadmix), but with this reference panel?

hyperoplus commented 3 years ago

The samples in all runs are the same, irrespective of being NGSadmix or fastNGSadmix. For NGSadmix I calculated all samples at the same time, without prior information on population origin (as many Structure-like analyses tend to be run). For fastNGSadmix I ran each sample individually - the objective was to calculate membership proportions to each of two main populations.

The reference panel of fastNGSadmix was made using allele frequencies calculated from the data - I used Angsd's doMaf option (reference allele fixed as major) to calculate the frequencies of the major allele for Ref pop 1 and Ref pop 2, and used those values to manually construct a panel as shown above. Then, for each sample, I: 1) calculated genotype likelihoods in a Beagle format using Angsd; 2) used those genotype likelihoods in fastNGSadmix to calculate the proportion of membership to each of the two reference populations.

So, in the fastNGSadmix results, by calculating membership proportions of the same individuals that were used to calculate the allele frequencies of the reference panel, I was hoping it was going to be somewhat redundant, and individuals would be placed 100% (or close) into their respective cluster, but this shouldn't in any case affect the membership proportions of samples from other populations.

e-jorsboe commented 3 years ago

Sorry about the late answer.

So you estimate the frequncies for all of the samples in ANGSD, and then use those as a reference panel? And then you run each sample with that reference panel, knowing that it is already in the reference panel?

How can you get frequencies for popuplation 1 and 2, when using ANGSD on data with admixed individuals?

e-jorsboe commented 3 years ago

Hi, sorry about being late about getting back to you. Is this still an issue?

e-jorsboe / fastNGSadmix

Overestimation of assignment to one population in panel? #6