Closed samuelklee closed 3 weeks ago
Note that Vcfdist metrics are calculated over non-TR/homopolymer regions, while overlap metrics are calculated over all regions but only multiallelics are included in the count.
Shapeit4 -> PanGenie script -> final panel (baseline, from #21; this goes through the default PanGenie panel-creation script, which just drops an entire site if there are any inconsistent haplotypes in any of the alleles):
PhasedPanelEvaluation (final panel + all stages): https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/ba390f1d-82f2-403b-a00b-4c3fe795a5e3 LeaveOutEvaluation: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/27e954fd-d261-4bb8-a730-e0336268f376
Shapeit4 HG002 Vcfdist:
VAR_TYPE THRESHOLD MIN_QUAL TRUTH_TP QUERY_TP TRUTH_FN QUERY_FP PREC RECALL F1_SCORE F1_QSCORE
SNP NONE 0 16441 16424 2304 322 0.980772 0.877087 0.926036 11.309807
SNP BEST 0 16441 16424 2304 322 0.980772 0.877087 0.926036 11.309807
INDEL NONE 0 994 1067 132 238 0.817625 0.882771 0.848950 8.208787
INDEL BEST 0 994 1067 132 238 0.817625 0.882771 0.848950 8.208787
SV NONE 0 23 21 5 3 0.875000 0.821429 0.847368 8.163556
SV BEST 0 23 21 5 3 0.875000 0.821429 0.847368 8.163556
ALL NONE 0 17458 17512 2441 563 0.968852 0.877331 0.920823 11.013992
ALL BEST 0 17458 17512 2441 563 0.968852 0.877331 0.920823 11.013992
Shapeit4 overlap metrics:
NUM_INCONSISTENT_ALLELES NUM_CONSISTENT_ALLELES NUM_INCONSISTENT_SITES NUM_CONSISTENT_SITES
683 10976 308 3557
Final panel HG002 Vcfdist:
VAR_TYPE THRESHOLD MIN_QUAL TRUTH_TP QUERY_TP TRUTH_FN QUERY_FP PREC RECALL F1_SCORE F1_QSCORE
SNP NONE 0 16397 16382 2348 317 0.981017 0.874740 0.924835 11.239855
SNP BEST 0 16397 16382 2348 317 0.981017 0.874740 0.924835 11.239855
INDEL NONE 0 994 1063 132 238 0.817064 0.882771 0.848647 8.200102
INDEL BEST 0 994 1063 132 238 0.817064 0.882771 0.848647 8.200102
SV NONE 0 19 17 9 3 0.850000 0.678571 0.754673 6.102544
SV BEST 0 19 17 9 3 0.850000 0.678571 0.754673 6.102544
ALL NONE 0 17410 17462 2489 558 0.969034 0.874918 0.919575 10.946066
ALL BEST 0 17410 17462 2489 558 0.969034 0.874918 0.919575 10.946066
Final panel overlap metrics:
NUM_INCONSISTENT_ALLELES NUM_CONSISTENT_ALLELES NUM_INCONSISTENT_SITES NUM_CONSISTENT_SITES
0 12832 0 4297
HG00733 leave-out:
Shapeit4 -> FixVariantCollisions -> final panel:
PanGeniePanelCreation: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/e68f9cdc-6b10-4604-b579-762d654cd1c5 VcfdistAndOverlapMetricsEvaluation (PanGenie panel only): https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/cffb6772-082c-409e-b7d3-ca8f07b3c1f4 LeaveOutEvaluation: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/d94c4c29-0efe-46fb-996f-73b388576e5d
After FixVariantCollisons overlap metrics (this was run manually and separately from the submissions just above; this is because we haven't WDLized the tool and inserted it into PhasedPanelEvaluation yet):
NUM_INCONSISTENT_ALLELES NUM_CONSISTENT_ALLELES NUM_INCONSISTENT_SITES NUM_CONSISTENT_SITES
0 11659 0 3865
Final panel HG002 Vcfdist:
VAR_TYPE THRESHOLD MIN_QUAL TRUTH_TP QUERY_TP TRUTH_FN QUERY_FP PREC RECALL F1_SCORE F1_QSCORE
SNP NONE 0 16389 16363 2356 353 0.978882 0.874313 0.923648 11.171772
SNP BEST 0 16389 16363 2356 353 0.978882 0.874313 0.923648 11.171772
INDEL NONE 0 978 1047 148 237 0.815421 0.868561 0.841152 7.990194
INDEL BEST 0 978 1047 148 237 0.815421 0.868561 0.841152 7.990194
SV NONE 0 15 13 13 3 0.812500 0.535714 0.645695 4.506231
SV BEST 0 15 13 13 3 0.812500 0.535714 0.645695 4.506231
ALL NONE 0 17382 17423 2517 593 0.967085 0.873511 0.917919 10.857597
ALL BEST 0 17382 17423 2517 593 0.967085 0.873511 0.917919 10.857597
Final panel overlap metrics:
NUM_INCONSISTENT_ALLELES NUM_CONSISTENT_ALLELES NUM_INCONSISTENT_SITES NUM_CONSISTENT_SITES
0 12762 0 4302
HG00733 leave-out:
So things actually look a bit worse with FixVariantCollisions in this initial run. Especially the SV recall, which has now dropped a lot; note also that this is only over non-TRs for Vcfdist, so it will be good to stratify and see what is happening there. But I would hope that tweaking weights, etc. will resolve things. And I hope that improvement from FixVariantCollisions will become more apparent for HPRC+AoU1, since the dropping of alleles by the PanGenie script will be more prevalent there.
Also, note that we are still looking good w.r.t. PanGenie; e.g., HG00733 leave-out for the Shapeit4 -> FixVariantCollisions -> final panel:
So we don’t have to wait on #27. This will be using a panel manually created by taking Shapeit4 output and running it through FixVariantCollisions with all calls given unit weight before continuing to the PanGenie panel-creation script as usual.