Closed gtiao closed 1 year ago
Someone (Alicia? Konrad?) reported getting excellent results training an ancestry RF on 1KG+HGDP and applying to UKBB pan-ancestry project. Let's compare to the gnomAD RF performance and see if we can get better results.
This really only applies WGS data (v3) instead of exome (v4)
we don't have labels for v4 that we can use (we have labels for v3). Mike was planning on looking at whether we should use only use known labels (HGDP/TGP) or v3 imputed labels
I moved the wrong ticket here. This will happen later in analysis, not this sprint.
Decided on:
min_recall
) is used to choose per ancestry group minimum RF probabilities. This min_recall
cutoff is applied first, and if the chosen minimum RF probabilities cutoff results in a precision lower than min_precision
, the minimum RF probabilities with the highest recall that meets min_precision
is used.)min_prob_cutoffs={'afr': 0.93, 'ami': 0.96, 'amr': 0.86, 'asj': 0.88, 'eas': 0.96, 'fin': 0.91, 'mid': 0.56, 'nfe': 0.78, 'sas': 0.96}
This ticket is aimed at doing the analysis for v4 population assignments (as opposed to writing the PR #494)