AI-sandbox / gnomix

A fast, scalable, and accurate local ancestry method.
Other
75 stars 13 forks source link

Confused by implausible results on array data #28

Closed buyske closed 1 year ago

buyske commented 2 years ago

I'm very confused by the implausible results I'm getting on two African American cohorts, one genotyped on an Axiom array and one on an Omni array. For each, I phased them using the Michigan server and ran gnomix against the pretrained model with the default settings other than setting "inference: best" in config.yaml

However, I am getting means of .26 for the proportion of EUR ancestry, .25 for NAT, and .44 for AHG in one cohort, and .15, .18, and .65 respectively in the other cohort. There's not a lot of individual variability; for example, in the first cohort, the min and max for AHG are .41 and .47.

I get very similar results regardless of whether I have gnomix correct the phasing or whether I restrict to SNPs that are in the pretrained model. I'm sure I'm doing something wrong, but I'm baffled. I could use some guidance on how to debug this.

AlexIoannidis commented 2 years ago

These are phased and imputed using the Michigan server?

The erratic ancestry proportions are a sign that the snps are misbehaving. In particular, the high AHG ancestry is a sign that the model is attempting to fit segments using the most distant branch of the population tree, which means the model is seeing variants that it thinks it has never seen before. This means that something is going wrong with the merge between the snps in your data and the ones in the model. The possibilities are that the builds don't match or that although the builds match, some of your snp variants are defined off of the opposite strand to those in the model.

buyske commented 2 years ago

Whoops, overlooked your reply.

I used the Michigan server for phasing, but I requested just phasing and not imputation (am I misunderstanding the process and I should be using imputed data for array-based genotyping?).

Based on looking at SNP locations, it looks like the builds match up. Does the Michigan server pipeline put the variants on the same strand as the model? Maybe I should try again with just the non-palindromic SNPs. I confess that the QC on the original genotyping was not done by me and I don't have ready access to the details their QC pipeline (I'm seeing if I can track it down), so I don't know what strand alignment was done. I may have been allowing for too much magical thinking regarding the Michigan pre-phasing pipeline.

buyske commented 2 years ago

Update: I've run the data through the "HRC or 1000G Imputation preparation and checking" pipeline using CAAPA to straighten out any strand issues, and then used the Michigan imputation server for QC and phasing again with CAAPA as a reference. Did that with and without excluding palindromic SNPs, and then ran gnomix.

Unfortunately, I'm still getting the unlikely results.

AlexIoannidis commented 2 years ago

Yes, if you are using the models that were trained on whole genomes, then you do want to impute any genotype data to those same sites used in the model (given in the bim files included with the models). I'm editing the readme, because you are right that we do not make this clear anywhere.

buyske commented 1 year ago

I ended up training my own models and got entirely plausible results.