broadinstitute / seqr-loading-pipelines

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch
MIT License
22 stars 20 forks source link

Audit of Sample QC Genetic Ancestry Imputation on gnomAD v2+CMG vs gnomAD v4 Model #853

Open matren395 opened 1 month ago

matren395 commented 1 month ago

Investigate differences in genetic ancestry imputation in Seqr data when upgrading imputed genetic ancestry models from gnomAD v2+CMG to v4. Among the GATK WES v23 callset, we've seen modest differences classification back and forth on 'Middle Eastern' <-> 'Other' between the two models. I posted a fairly large Slack message about this in #cmg-analysis this week, that I'll post below :

Hi analysts and analysis team! Apologies for the spam but some modestly analysis-relevant seqr news below: The seqr team have been making some improvements to our sample QC code (which produces per-sample flags as well as imputed genetic ancestry) to both automate it with future loading and to get it to run on :dragon: DRAGEN-samples . Concerning the imputed genetic ancestry , we are updating the model we use for this from the previous Random Forest model & loadings from gnomAD v2 (+ CMG samples) to gnomAD v4. The prior model is very old and was written with some (now deprecated and) extremely outdated versions of some software packages that make it unreadable in recent versions of Python and Hail - possible to do single runs for, but this poses a “nigh-insurmountable technical barrier” to automation. Plus, v4 shoouuld just be better Working from the most recent Whole Exome Sequencing (WES) callset with ~19,000 samples, we’ve compared the imputed genetic ancestry from the prior model (v2+CMG) to the recent one (v4). Of note is the fact that there are CMG samples in this Exome callset, which may bias the results. I’d be happy to exclude them and look again, if people would like.

The overall breakdown is as follows: v4 nfe: 9480 oth: 5578 asj: 306 amr: 601 afr: 1149 eas: 361 sas: 973 mid: 655 fin: 28 v2+CMG nfe: 11801 oth: 2928 asj: 294 amr: 621 afr: 1070 eas: 419 sas: 964 mde: 1012 fin: 22

With then a matrix comparing how individuals change from the prior model (as rows) to the recent model (as cols), down below. Of note is the shuffling of mid<->oth imputation, where 899 previously mid samples are now oth , and conversely 513 previously oth samples are now mid . And in general, there is the v4 model’s willingness to label things as oth that were not previously. In general, gnomAD v4 is a much more diverse dataset than gnomAD v2+CMG. v4 has ~3000 people of mid genetic ancestry, while gnomAD v2 has none (labelled) and only ~350 reported from the CMG Gleeson cohort. Further, CMG Gleeson was pretty opaque (!?) about how these IDs were recorded , and might’ve been collected in closed proximity to each other or potentially the same place. v4 may just be classifying these samples differently because it just doesn’t have any training samples from that exact setting. Overall, (Mike Wilson quote) v4 should be better moving forward at classifying data

Screenshot 2024-07-23 at 3 48 23 PM