broadinstitute / lrma-aou1-panel-creation

Pipelines and evaluations covering integration, phasing, and imputation of short and structural variants for the AoU Phase 1 long-reads callset.
1 stars 0 forks source link

TODO for first draft of Phase 1 paper. #52

Open samuelklee opened 2 days ago

samuelklee commented 2 days ago

@hangsuUNC Run HiPhase on whole genome using current parameters with no filtering for all samples:

@samuelklee Experiment with reducing short-variant count for Shapeit4 while maintaining SV imputation performance (Iris is focusing on AF > 0.05 for eQTL analyses):

@hangsuUNC Statistical phasing:

Goal is for phasing to be complete done within the first week or so. Hopefully, HiPhase can be done in the next day or two, at the very least; Shapeit4 could stand some tinkering with cutting variant count, but at some point we should just bite the bullet. Perhaps after running the sharded workflow on a single chromosome and confirming the cost. What do you think, @hangsuUNC?

@samuelklee Main figure (w/ chr6 first, WG when ready):

@samuelklee Generate inputs for eQTL:

@samuelklee Supplementary figures:

@kvg anything missing?

EDIT: Rather than leave-out-all-HPRC, let's do leave-out-40-HPRC---these are the 40 in TGP with readily available SR.

hangsuUNC commented 17 hours ago


  1. Single-sample whole-genome Hiphase test 1: cost: $2.27. cost distribution: single_sample_hiphase cost
  2. Single-sample whole-genome Hiphase test2, reduce VM size to 2cpu, 8 GB: cost: $1.04. Cost distribution: single_sample_hiphase cost
samuelklee commented 16 hours ago

Thanks for the update! Two things:

1) Note that one flaw of the cost-estimation script is that it doesn’t actually give the real cost. It doesn’t account for preemption and cost rates for various resources are provided as fixed parameters. But hopefully we can roughly trust the distribution it reports.

2) If so, this suggests that with the per-sample bcftools view, at best we can hope for about a 50-50 split between subsetting and HiPhase. This means we will end up paying at least $500 just to subset the joint VCF! Surely a bcftools split strategy can beat that?

samuelklee commented 15 hours ago

Will add notes on submissions for chr6 evaluations against dipcall truth here:

40 HPRC GATK-SV, w/o realign flags, 5 samples failing: 40 HPRC GATK-SV, w/ realign flags, 1 sample failing:

40 HPRC SR vs. full HPRC+AoU1, w/o realign flags: 40 HPRC SR vs. full HPRC+AoU1, w/ realign flags:

InputPhasedPanelEvaluation construction/evaluation of HPRC+AoU1-leave-out-40-HPRC:

40 HPRC LR HPRC+AoU1 filter+concat, w/ realign flags: 40 HPRC LR HPRC+AoU1 panel, w/ realign flags:

Note that the realign flags seem to make a minimal impact, at least from spot checking; I'll double-check with plots across all samples.

samuelklee commented 10 hours ago

Given maintained accuracy seen in, we can proceed with a sharded run with AF>=0.01 + 10kb SV windowing over all of chr6.

To determine more appropriate shards, we can use a run linked in that comment, for which we only ran 2 manually specified 10Mb Shapeit4 shards:

This run contains an filtered+windowed+FilterAndConcat (where FilterAndConcat refers to singleton filtering and short+SV concatenation) VCF over all of chr6, not just the 2 10Mb shards, which is the input to Shapeit4: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/6e2c5ed2-a102-466e-9dbc-8eae7bae1021/PhasedPanelEvaluation/bf009a41-e06b-446e-a62b-03373f65dc8c/call-FilterAndConcatVcfs/HPRCAOU.chr6.filter_and_concat.vcf.gz

We can use the GLIMPSE1 chunk tool on this VCF to generate better shards than our naive 10Mb shards:

./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 10000000 --buffer-size 500000 -O chr6.chunks.tsv

[GLIMPSE] Split chromosomes into chunks
  * Author        : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : &
  * Version       : 1.1.1
  * Run date      : 03/10/2024 - 20:59:14

  * Input VCF      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * Chromosome     : [chr6]
  * Output file    : [chr6.chunks.tsv]

  * Seed             : 15052011
  * #Threads   : 1
  * Min. Window size : 10000000bp / 1000 variants
  * Min. Buffer size : 500000bp / 100 variants

Reading input files
  * Main      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * #variants = 1009520 (385.52s)

Splitting data into chunks and writting to [chr6.chunks.tsv]
  * Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
  * Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
  * Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
  * Terminal window [0] -buffer:[chr6:72241-19099133] / +buffer:[chr6:72241-19599242] / L=19026893bp / C=126191
  * Terminal window [1] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
  * Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
  * Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
  * Terminal window [2] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
  * Terminal window [3] -buffer:[chr6:42755911-54941581] / +buffer:[chr6:42255897-55441622] / L=12185671bp / C=63095
  * Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
  * Terminal window [4] -buffer:[chr6:54941691-67331114] / +buffer:[chr6:54441197-67831170] / L=12389424bp / C=63095
  * Terminal window [5] -buffer:[chr6:67331412-78630557] / +buffer:[chr6:66796360-79162564] / L=11299146bp / C=63094
  * Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
  * Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
  * Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
  * Terminal window [6] -buffer:[chr6:78630635-91152068] / +buffer:[chr6:78130532-91652103] / L=12521434bp / C=63096
  * Terminal window [7] -buffer:[chr6:91152292-104071527] / +buffer:[chr6:90652274-104577123] / L=12919236bp / C=63095
  * Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
  * Terminal window [8] -buffer:[chr6:104071528-118372982] / +buffer:[chr6:103566858-118885012] / L=14301455bp / C=63095
  * Terminal window [9] -buffer:[chr6:118373301-132352709] / +buffer:[chr6:117873097-132988895] / L=13979409bp / C=63094
  * Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
  * Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
  * Terminal window [10] -buffer:[chr6:132352735-146564675] / +buffer:[chr6:131821158-147064995] / L=14211941bp / C=63098
  * Terminal window [11] -buffer:[chr6:146564679-157241583] / +buffer:[chr6:146055131-157741653] / L=10676905bp / C=63092
  * Terminal window [12] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
  * #chunks = 13

Total running time = 385 seconds

Cutting the regions from the resulting file yields:

cut -f 3 chr6.chunks.tsv


Copied to gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/scratch/slee/chr6.chunks.region.tsv.

Did this on a VM manually, we should put it into the workflow before kicking off WG. Probably doesn't make much of a difference since our panel is probably plenty dense, but it's the sort of thing you're supposed to do, so we might as well.

Kicked off the PhasedPanelEvaluationFromHiPhase workflow leaving out 40 HPRC samples here:

If the numbers look good, we should run InputPhasedPanelEvaluation using the Shapeit4 result (which is on the full panel) here to generate the full KAGE+GLIMPSE panel, rather than the leave-out. Then we can reimpute MAGE and get feedback on whether this reduced chr6 panel with fewer short variants is acceptable for eQTLs. If so, then we can proceed to WG. Alternatively, if the costs start looking more reasonable without HiPhase, then we can go ahead with the unfiltered/windowed panel.

UPDATE: The second shard consistently fails with this scheme, even going up to a very underutilized 96GB (see and This shard has the most variants at ~140k, although a couple of others have ~120-130k and succeed. Others have ~60k.

Bumping down the min shard size to 5Mb yields:

./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 5000000 --buffer-size 500000 -O chr6.chunks.tsv --thread 4

[GLIMPSE] Split chromosomes into chunks
  * Author        : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : &
  * Version       : 1.1.1
  * Run date      : 04/10/2024 - 02:32:49

  * Input VCF      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * Chromosome     : [chr6]
  * Output file    : [chr6.chunks.tsv]

  * Seed             : 15052011
  * #Threads   : 4
  * Min. Window size : 5000000bp / 1000 variants
  * Min. Buffer size : 500000bp / 100 variants

Reading input files
  * Main      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * #variants = 1009520 (323.05s)

Splitting data into chunks and writting to [chr6.chunks.tsv]
  * Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
  * Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
  * Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
  * Internal window [chr6:72241-19099133] / L=19026893bp / C=126191
  * Terminal window [0] -buffer:[chr6:72241-8411080] / +buffer:[chr6:72241-8911239] / L=8338840bp / C=63096
  * Internal window [chr6:8411530-19099133] / L=10687604bp / C=63095
  * Terminal window [1] -buffer:[chr6:8411530-13529316] / +buffer:[chr6:7911474-14029318] / L=5117787bp / C=31548
  * Terminal window [2] -buffer:[chr6:13529476-19099133] / +buffer:[chr6:13026927-19599242] / L=5569658bp / C=31547
  * Terminal window [3] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
  * Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
  * Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
  * Terminal window [4] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
  * Internal window [chr6:42755911-54941581] / L=12185671bp / C=63095
  * Terminal window [5] -buffer:[chr6:42755911-48786404] / +buffer:[chr6:42255897-49306975] / L=6030494bp / C=31548
  * Terminal window [6] -buffer:[chr6:48786493-54941581] / +buffer:[chr6:48259536-55441622] / L=6155089bp / C=31547
  * Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
  * Internal window [chr6:54941691-67331114] / L=12389424bp / C=63095
  * Terminal window [7] -buffer:[chr6:54941691-61954841] / +buffer:[chr6:54441197-62483380] / L=7013151bp / C=31551
  * Terminal window [8] -buffer:[chr6:61954846-67331114] / +buffer:[chr6:61454804-67831170] / L=5376269bp / C=31544
  * Internal window [chr6:67331412-78630557] / L=11299146bp / C=63094
  * Terminal window [9] -buffer:[chr6:67331412-73228807] / +buffer:[chr6:66796360-73751975] / L=5897396bp / C=31548
  * Terminal window [10] -buffer:[chr6:73229192-78630557] / +buffer:[chr6:72728849-79162564] / L=5401366bp / C=31546
  * Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
  * Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
  * Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
  * Internal window [chr6:78630635-91152068] / L=12521434bp / C=63096
  * Terminal window [11] -buffer:[chr6:78630635-84669766] / +buffer:[chr6:78130532-85169973] / L=6039132bp / C=31550
  * Terminal window [12] -buffer:[chr6:84669877-91152068] / +buffer:[chr6:84168043-91652103] / L=6482192bp / C=31546
  * Internal window [chr6:91152292-104071527] / L=12919236bp / C=63095
  * Terminal window [13] -buffer:[chr6:91152292-97518137] / +buffer:[chr6:90652274-98065574] / L=6365846bp / C=31548
  * Terminal window [14] -buffer:[chr6:97518139-104071527] / +buffer:[chr6:97018004-104577123] / L=6553389bp / C=31547
  * Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
  * Internal window [chr6:104071528-118372982] / L=14301455bp / C=63095
  * Terminal window [15] -buffer:[chr6:104071528-110454782] / +buffer:[chr6:103566858-110954891] / L=6383255bp / C=31548
  * Terminal window [16] -buffer:[chr6:110454783-118372982] / +buffer:[chr6:109925816-118885012] / L=7918200bp / C=31547
  * Internal window [chr6:118373301-132352709] / L=13979409bp / C=63094
  * Terminal window [17] -buffer:[chr6:118373301-125099407] / +buffer:[chr6:117873097-125599569] / L=6726107bp / C=31548
  * Terminal window [18] -buffer:[chr6:125099514-132352709] / +buffer:[chr6:124599490-132988895] / L=7253196bp / C=31546
  * Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
  * Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
  * Internal window [chr6:132352735-146564675] / L=14211941bp / C=63098
  * Terminal window [19] -buffer:[chr6:132352735-138960888] / +buffer:[chr6:131821158-139482591] / L=6608154bp / C=31550
  * Terminal window [20] -buffer:[chr6:138960971-146564675] / +buffer:[chr6:138460884-147064995] / L=7603705bp / C=31548
  * Internal window [chr6:146564679-157241583] / L=10676905bp / C=63092
  * Terminal window [21] -buffer:[chr6:146564679-151991440] / +buffer:[chr6:146055131-152491564] / L=5426762bp / C=31547
  * Terminal window [22] -buffer:[chr6:151991503-157241583] / +buffer:[chr6:151491219-157741653] / L=5250081bp / C=31545
  * Terminal window [23] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
  * #chunks = 24

Total running time = 323 seconds


UPDATE: Well, that didn't help, since variant count didn't drop in that failing shard---most likely this is HLA. Just went back to the original 13 shards and cranked up to 128GB. It might be a good idea to have a strategy for tuning runtimes to a particular sharding scheme at some point, if OOM retry continues to be so unreliable.

samuelklee commented 9 hours ago

Once these runs complete, I think we have the following cohorts for comparison on 40 HPRC samples, which should provide a basis for plots of summary statistics and accuracy vs. dipcall: