Compare ancestry inference performance of v4 RF models

gtiao commented 3 years ago

This ticket is aimed at doing the analysis for v4 population assignments (as opposed to writing the PR #494)

gtiao commented 3 years ago

Someone (Alicia? Konrad?) reported getting excellent results training an ancestry RF on 1KG+HGDP and applying to UKBB pan-ancestry project. Let's compare to the gnomAD RF performance and see if we can get better results.

gtiao commented 2 years ago

This really only applies WGS data (v3) instead of exome (v4)

ch-kr commented 2 years ago

we don't have labels for v4 that we can use (we have labels for v3). Mike was planning on looking at whether we should use only use known labels (HGDP/TGP) or v3 imputed labels

mike-w-wilson commented 1 year ago

I moved the wrong ticket here. This will happen later in analysis, not this sprint.

klaricch commented 1 year ago

Decided on:

20 PCs
Training samples:
hgdp/tgp samples
v2 samples with known pops
v3 samples with known pop for certain cohorts only (Decisions were made based on results of an analysis to determine which v3 samples/cohorts to use as training samples. This analysis consisted of computing per sample mean Euclidean distances to all samples in a given population, and per sample the mean Euclidean distances limited to only HGDP/1KG samples in each population)
Spike in v4 samples with race/ethnicity of "Arab" of "Persian"
min_prob parameter decided per pop using min_recall and min_precision of 0.99 ( Minimum recall (min_recall) is used to choose per ancestry group minimum RF probabilities. This min_recall cutoff is applied first, and if the chosen minimum RF probabilities cutoff results in a precision lower than min_precision, the minimum RF probabilities with the highest recall that meets min_precision is used.)

min_prob_cutoffs={'afr': 0.93, 'ami': 0.96, 'amr': 0.86, 'asj': 0.88, 'eas': 0.96, 'fin': 0.91, 'mid': 0.56, 'nfe': 0.78, 'sas': 0.96}

broadinstitute / gnomad_qc

Compare ancestry inference performance of v4 RF models #161