brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
270 stars 36 forks source link

Inconsistent ancestry predictions between Somalier and Peddy #103

Open mpj5142 opened 2 years ago

mpj5142 commented 2 years ago

Hello,

I am currently using your Somalier software to check for relatedness and uniformly calculate ancestry PCAs across several cohorts for a meta-analysis. Our cohorts are mostly individuals with known European ancestry; however, Somalier's ancestry function calls most samples as AMR super-group based on the 1K Genomes dataset. AMPAD_affy_preimpute somalier_ancestry

Other members of my lab have previously used your Peddy software for the same calculations, so I went back and checked the results with that software using the same underlying dataset (only change was to remove "chr" from the VCF file). Here, the results look as expected, with most samples labeled as the EUR super-group. AMPAD_affy_preimpute pca_check

For reference, here is the code I used for each software: ./somalier extract -d AMPAD_affy_preimpute/ --sites sites.hg38.vcf.gz -f Homo_sapiens_assembly38.fasta ROSMAP_affy_preimpute_hg38.vcf.gz

./somalier ancestry --labels ancestry-labels-1kg.tsv --n-pcs=10 -o AMPAD_affy_preimpute 1kg-somalier/*.somalier ++ AMPAD_affy_preimpute/*.somalier

python -m peddy --sites hg38 --plot --prefix AMPAD_affy_preimpute ROSMAP_affy_preimpute_hg38_nochr.vcf.gz ROSMAP_affy_genotypes_hg38_final.fam

I was wondering if you have come across this issue before, or would have any insights into the different results? (I can send over an example VCF if you would like to trouble-shoot; it will be a different cohort than the plots above, as those are restricted data.) Thanks!

brentp commented 2 years ago

Hi, I have known there are some issues with the somalier ancestry setup. You can trust the peddy ones (as you note) much more. For somalier, I would use --n-pcs 4 or less. It treats each PC equally, even though the first few explain much more variance. That should work for easy cases (which yours appears to be), the problem will be that somalier will confidently predict ancestry even in the true ancestry is one it has never been trained on.

mpj5142 commented 2 years ago

Thanks Brent! Unfortunately, reducing the PCAs to 4 still resulted in most samples being called as AMR.

I will note that I did not input any of the known ancestries for our samples when running Somalier--I can try this to see if the samples with missing ancestry will be imputed better, although Peddy is already giving results more in-line with our expectations, so I may just stick with those results.

Thanks again for your help!

AMPAD_affy_preimpute somalier_ancestry

brentp commented 2 years ago

Is your data from sequencing? Exome? WGS? Or from a chip?

mpj5142 commented 2 years ago

This is array data--I filtered for some basic QC steps (i.e. genotype call rate) before running Somalier. However, I obtained similar results when running the software on SNP array data imputed from TOPMED as well as WES and WGS-based datasets.

cgroza commented 4 months ago

Hi,

I am experiencing the same issues.

I am using the 1KGP dataset as the reference, and somalier labels most of my samples as AMR. I have been using 5 PCs, but even when plotting just PC1 and PC2, the samples appear to be clustered in the wrong location (they should be clustering with AFR).