broadinstitute / lrma-aou1-panel-creation

Pipelines and evaluations covering integration, phasing, and imputation of short and structural variants for the AoU Phase 1 long-reads callset.
1 stars 0 forks source link

Check remaining site drops in PanGenie panel-creation script. #42

Open samuelklee opened 2 months ago

samuelklee commented 2 months ago

See logs from:

chr1:100-110Mbp run: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/aca454c4-941b-4cfa-9dae-152b3e5d5829/PhasedPanelEvaluation/20fb92f7-27af-4ec8-b944-f1a285f2da1e/call-PanGeniePanelCreation/PanGeniePanelCreation/4fd17c7d-2dff-41e7-b6ae-b411c11077c1/call-PanGeniePanelCreation/cacheCopy/glob-c94d492e4d5a9e6759399733eb456839/merge-haplotypes.log chr6 run: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/intermediates/dea0b0fd-fa39-400e-8a10-655be8c99eb2/PhasedPanelEvaluation/ac2cde59-d8ed-4b3e-9874-84292d37c01d/call-PanGeniePanelCreation/PanGeniePanelCreation/f0230889-f88e-4814-b45f-c20da82a23f7/call-PanGeniePanelCreation/glob-c94d492e4d5a9e6759399733eb456839/merge-haplotypes.log

The number of sites being dropped is now relatively small, so this is lower priority for now, but it would be good to understand this. As discussed with @fabio-cunial earlier today after he spot checked some examples, these sites might not actually yield inconsistent haplotypes but might be dropped by the PanGenie script anyway. We can either modify the script or perhaps consider skipping it altogether---the end goal is to just make sure KAGE panel creation succeeds.

samuelklee commented 2 months ago

Some comments from Fabio:

In file (chr6):

gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/intermediates/dea0b0fd-fa39-400e-8a10-655be8c99eb2/PhasedPanelEvaluation/ac2cde59-d8ed-4b3e-9874-84292d37c01d/call-PanGeniePanelCreation/PanGeniePanelCreation/f0230889-f88e-4814-b45f-c20da82a23f7/call-PanGeniePanelCreation/glob-c94d492e4d5a9e6759399733eb456839/merge-haplotypes.log

I see the following cases:
- There is just one call at the POS reported in the log file (?!).
- SNP and DEL with the same POS: I don't consider this to be a collision, since the VCF convention for DELs is to set POS to the position that immediately precedes the DEL.
- INS and DEL with the same POS: this is not a collision, actually it's a way to encode a replacement, I think (POS for INS is the value such that the INS occurs between POS and POS+1).
- Several overlapping replacements, but no collision in any sample by checking the genotypes.

In file (chr1):

gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/aca454c4-941b-4cfa-9dae-152b3e5d5829/PhasedPanelEvaluation/20fb92f7-27af-4ec8-b944-f1a285f2da1e/call-PanGeniePanelCreation/PanGeniePanelCreation/4fd17c7d-2dff-41e7-b6ae-b411c11077c1/call-PanGeniePanelCreation/cacheCopy/glob-c94d492e4d5a9e6759399733eb456839/merge-haplotypes.log

- I see that there are 7 distinct error events, not one (?!).
- I see just the case of no conflicting genotypes in any sample by checking the genotypes.

Note in the second to last bullet, Fabio is referencing the fact that my naive CalculateOverlapMetrics script reported only one event, but this is because it only looks at sites considered multiallelic by bcftools norm and hence only gives a lower bound.