Consider making phase blocks more stringent, and add a method to test switch rate within phase blocks

broadinstitute / lrma-aou1-panel-creation

Pipelines and evaluations covering integration, phasing, and imputation of short and structural variants for the AoU Phase 1 long-reads callset.

1 stars 0 forks source link

Consider making phase blocks more stringent, and add a method to test switch rate within phase blocks #13

Closed rlorigro closed 2 months ago

rlorigro commented 3 months ago

Using the parameter here: https://github.com/PacificBiosciences/HiPhase/blob/9b01d7eda42a56dc9d20d39aa8b6fe23b60f5ab1/src/cli.rs#L210

we can increase the threshold and get smaller, more confident phase blocks that we can send to Shapeit4

We would need to test how this affects the intra-phaseblock switch rate

hangsuUNC commented 2 months ago

Hi Ryan,

I'm not sure I understand the issue. Do you intend to vary the parameter "global-failure-count" to a larger number (>50) to see how it affect the physical phasing accuracy?

Best,

Hang

rlorigro commented 2 months ago

Sorry, somehow the line number was incorrect. This is the one I intended: https://github.com/PacificBiosciences/HiPhase/blob/9b01d7eda42a56dc9d20d39aa8b6fe23b60f5ab1/src/cli.rs#L150

rlorigro commented 2 months ago

1 read is a very loose criteria for joining variants into a phase block. I think we want something more conservative, so that there aren't as many switches upstream from Shapeit. Asking Shapeit to undo the errors is probably more complicated than fixing the errors upstream. Increasing the threshold will come at the cost of having shorter phase blocks, but i think it should be OK for Shapeit to handle many smaller ones. Given our low coverage, 2 might be a good place to start.

hangsuUNC commented 2 months ago

Sounds good to me! Will test in wdl and get back to you later!

hangsuUNC commented 2 months ago

A brief update to the result. I use "--min-spanning-reads 2" as an additional argument to hiphase and comparing the result with the baseline on the main branch. as expected, the switch error rate reduce, but the phase block number increase. Precision Recall and F1 basically remains the same. Not sure why, but the F1 score for the main branch is ~0.89, which is lower than the previous experiment. Here are the results:

baseline for filtered Small and SVs
PHASE_BLOCKS    SWITCH_ERRORS   FLIP_ERRORS NG_50   SWITCH_NGC50    SWITCHFLIP_NGC50
41  20  1   0   0   0

Experiment with --min-spanning_reads 2
PHASE_BLOCKS    SWITCH_ERRORS   FLIP_ERRORS NG_50   SWITCH_NGC50    SWITCHFLIP_NGC50
59  12  0   0   0   0

The runs are here: baseline: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/d5db385e-17e5-4010-992e-a4a5a5bfe5f0 experiment: https://app.terra.bio/#workspaces/broad-firecloud-dsde/lrma-aou1-panel-creation-hprc-only/job_history/999c7009-0bac-4803-9bc3-bbdb1d712523

rlorigro commented 2 months ago

Can you remind me how big the test region is? Whole genome? I took a look at the Terra runs but I didn't see anything

Also is there any effect on the incompatible alleles?

hangsuUNC commented 2 months ago

Can you remind me how big the test region is? Whole genome? I took a look at the Terra runs but I didn't see anything

Also is there any effect on the incompatible alleles?

The region is chr1:100Mb-110Mb. Here is the inconsistent metric:

baseline:
NUM_INCONSISTENT_ALLELES    NUM_CONSISTENT_ALLELES  NUM_INCONSISTENT_SITES  NUM_CONSISTENT_SITES
33  11626   14  3851

experiment:
NUM_INCONSISTENT_ALLELES    NUM_CONSISTENT_ALLELES  NUM_INCONSISTENT_SITES  NUM_CONSISTENT_SITES
34  11625   14  3851

rlorigro commented 2 months ago

ok thanks. interesting that it didnt change much