broadinstitute / lrma-aou1-panel-creation

Pipelines and evaluations covering integration, phasing, and imputation of short and structural variants for the AoU Phase 1 long-reads callset.
1 stars 0 forks source link

Perform leave-out evaluation using panel manually created with FixVariantCollisions. #28

Closed samuelklee closed 3 weeks ago

samuelklee commented 1 month ago

So we don’t have to wait on #27. This will be using a panel manually created by taking Shapeit4 output and running it through FixVariantCollisions with all calls given unit weight before continuing to the PanGenie panel-creation script as usual.

samuelklee commented 4 weeks ago

Note that Vcfdist metrics are calculated over non-TR/homopolymer regions, while overlap metrics are calculated over all regions but only multiallelics are included in the count.

Shapeit4 -> PanGenie script -> final panel (baseline, from #21; this goes through the default PanGenie panel-creation script, which just drops an entire site if there are any inconsistent haplotypes in any of the alleles):

PhasedPanelEvaluation (final panel + all stages): https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/ba390f1d-82f2-403b-a00b-4c3fe795a5e3 LeaveOutEvaluation: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/27e954fd-d261-4bb8-a730-e0336268f376

Shapeit4 HG002 Vcfdist:

VAR_TYPE    THRESHOLD   MIN_QUAL    TRUTH_TP    QUERY_TP    TRUTH_FN    QUERY_FP    PREC    RECALL  F1_SCORE    F1_QSCORE
SNP NONE    0   16441   16424   2304    322 0.980772    0.877087    0.926036    11.309807
SNP BEST    0   16441   16424   2304    322 0.980772    0.877087    0.926036    11.309807
INDEL   NONE    0   994 1067    132 238 0.817625    0.882771    0.848950    8.208787
INDEL   BEST    0   994 1067    132 238 0.817625    0.882771    0.848950    8.208787
SV  NONE    0   23  21  5   3   0.875000    0.821429    0.847368    8.163556
SV  BEST    0   23  21  5   3   0.875000    0.821429    0.847368    8.163556
ALL NONE    0   17458   17512   2441    563 0.968852    0.877331    0.920823    11.013992
ALL BEST    0   17458   17512   2441    563 0.968852    0.877331    0.920823    11.013992

Shapeit4 overlap metrics:

NUM_INCONSISTENT_ALLELES    NUM_CONSISTENT_ALLELES  NUM_INCONSISTENT_SITES  NUM_CONSISTENT_SITES
683 10976   308 3557

Final panel HG002 Vcfdist:

VAR_TYPE    THRESHOLD   MIN_QUAL    TRUTH_TP    QUERY_TP    TRUTH_FN    QUERY_FP    PREC    RECALL  F1_SCORE    F1_QSCORE
SNP NONE    0   16397   16382   2348    317 0.981017    0.874740    0.924835    11.239855
SNP BEST    0   16397   16382   2348    317 0.981017    0.874740    0.924835    11.239855
INDEL   NONE    0   994 1063    132 238 0.817064    0.882771    0.848647    8.200102
INDEL   BEST    0   994 1063    132 238 0.817064    0.882771    0.848647    8.200102
SV  NONE    0   19  17  9   3   0.850000    0.678571    0.754673    6.102544
SV  BEST    0   19  17  9   3   0.850000    0.678571    0.754673    6.102544
ALL NONE    0   17410   17462   2489    558 0.969034    0.874918    0.919575    10.946066
ALL BEST    0   17410   17462   2489    558 0.969034    0.874918    0.919575    10.946066

Final panel overlap metrics:

NUM_INCONSISTENT_ALLELES    NUM_CONSISTENT_ALLELES  NUM_INCONSISTENT_SITES  NUM_CONSISTENT_SITES
0   12832   0   4297

HG00733 leave-out:

image

Shapeit4 -> FixVariantCollisions -> final panel:

PanGeniePanelCreation: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/e68f9cdc-6b10-4604-b579-762d654cd1c5 VcfdistAndOverlapMetricsEvaluation (PanGenie panel only): https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/cffb6772-082c-409e-b7d3-ca8f07b3c1f4 LeaveOutEvaluation: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/d94c4c29-0efe-46fb-996f-73b388576e5d

After FixVariantCollisons overlap metrics (this was run manually and separately from the submissions just above; this is because we haven't WDLized the tool and inserted it into PhasedPanelEvaluation yet):

NUM_INCONSISTENT_ALLELES    NUM_CONSISTENT_ALLELES  NUM_INCONSISTENT_SITES  NUM_CONSISTENT_SITES
0   11659   0   3865

Final panel HG002 Vcfdist:

VAR_TYPE    THRESHOLD   MIN_QUAL    TRUTH_TP    QUERY_TP    TRUTH_FN    QUERY_FP    PREC    RECALL  F1_SCORE    F1_QSCORE
SNP NONE    0   16389   16363   2356    353 0.978882    0.874313    0.923648    11.171772
SNP BEST    0   16389   16363   2356    353 0.978882    0.874313    0.923648    11.171772
INDEL   NONE    0   978 1047    148 237 0.815421    0.868561    0.841152    7.990194
INDEL   BEST    0   978 1047    148 237 0.815421    0.868561    0.841152    7.990194
SV  NONE    0   15  13  13  3   0.812500    0.535714    0.645695    4.506231
SV  BEST    0   15  13  13  3   0.812500    0.535714    0.645695    4.506231
ALL NONE    0   17382   17423   2517    593 0.967085    0.873511    0.917919    10.857597
ALL BEST    0   17382   17423   2517    593 0.967085    0.873511    0.917919    10.857597

Final panel overlap metrics:

NUM_INCONSISTENT_ALLELES    NUM_CONSISTENT_ALLELES  NUM_INCONSISTENT_SITES  NUM_CONSISTENT_SITES
0   12762   0   4302

HG00733 leave-out: image

So things actually look a bit worse with FixVariantCollisions in this initial run. Especially the SV recall, which has now dropped a lot; note also that this is only over non-TRs for Vcfdist, so it will be good to stratify and see what is happening there. But I would hope that tweaking weights, etc. will resolve things. And I hope that improvement from FixVariantCollisions will become more apparent for HPRC+AoU1, since the dropping of alleles by the PanGenie script will be more prevalent there.

samuelklee commented 4 weeks ago

Also, note that we are still looking good w.r.t. PanGenie; e.g., HG00733 leave-out for the Shapeit4 -> FixVariantCollisions -> final panel: image