broadinstitute / lrma-aou1-panel-creation

Pipelines and evaluations covering integration, phasing, and imputation of short and structural variants for the AoU Phase 1 long-reads callset.
1 stars 0 forks source link

TODO for first draft of Phase 1 paper. #52

Open samuelklee opened 2 days ago

samuelklee commented 2 days ago

@hangsuUNC Run HiPhase on whole genome using current parameters with no filtering for all samples:

@samuelklee Experiment with reducing short-variant count for Shapeit4 while maintaining SV imputation performance (Iris is focusing on AF > 0.05 for eQTL analyses):

@hangsuUNC Statistical phasing:

Goal is for phasing to be complete done within the first week or so. Hopefully, HiPhase can be done in the next day or two, at the very least; Shapeit4 could stand some tinkering with cutting variant count, but at some point we should just bite the bullet. Perhaps after running the sharded workflow on a single chromosome and confirming the cost. What do you think, @hangsuUNC?

@samuelklee Main figure (w/ chr6 first, WG when ready):

@samuelklee Generate inputs for eQTL:

@samuelklee Supplementary figures:

@kvg anything missing?

EDIT: Rather than leave-out-all-HPRC, let's do leave-out-40-HPRC---these are the 40 in TGP with readily available SR.

hangsuUNC commented 17 hours ago

Updates:

  1. Single-sample whole-genome Hiphase test 1:https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/3c4ce29e-e758-4d87-a4d6-e13e66d762ac cost: $2.27. cost distribution: single_sample_hiphase cost
  2. Single-sample whole-genome Hiphase test2, reduce VM size to 2cpu, 8 GB: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/0e6d1552-3d39-4c3b-b638-49c33ef735b2 cost: $1.04. Cost distribution: single_sample_hiphase cost
samuelklee commented 16 hours ago

Thanks for the update! Two things:

1) Note that one flaw of the cost-estimation script is that it doesn’t actually give the real cost. It doesn’t account for preemption and cost rates for various resources are provided as fixed parameters. But hopefully we can roughly trust the distribution it reports.

2) If so, this suggests that with the per-sample bcftools view, at best we can hope for about a 50-50 split between subsetting and HiPhase. This means we will end up paying at least $500 just to subset the joint VCF! Surely a bcftools split strategy can beat that?

samuelklee commented 15 hours ago

Will add notes on submissions for chr6 evaluations against dipcall truth here:

40 HPRC GATK-SV, w/o realign flags, 5 samples failing: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/16dd0d83-52da-4e48-b81e-fe51a87a90f7 40 HPRC GATK-SV, w/ realign flags, 1 sample failing: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/5aa1a8f0-8235-4e39-b11f-85266bc02bde

40 HPRC SR vs. full HPRC+AoU1, w/o realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/afd5183c-cc35-4949-a5bc-6f4f61b9e585 40 HPRC SR vs. full HPRC+AoU1, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/c642a3ca-afeb-40b8-a030-f46464867a3d

InputPhasedPanelEvaluation construction/evaluation of HPRC+AoU1-leave-out-40-HPRC: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/eb8b7b5c-4962-4b4f-a7d4-42ec72dd0cbe

40 HPRC LR HPRC+AoU1 filter+concat, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/128c0302-7110-43f7-93e5-95633875ea0d 40 HPRC LR HPRC+AoU1 panel, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/ef9cf97c-6342-42e6-a6e7-4772b9882e39

Note that the realign flags seem to make a minimal impact, at least from spot checking; I'll double-check with plots across all samples.

samuelklee commented 10 hours ago

Given maintained accuracy seen in https://github.com/broadinstitute/lrma-aou1-panel-creation/issues/38#issuecomment-239236454, we can proceed with a sharded run with AF>=0.01 + 10kb SV windowing over all of chr6.

To determine more appropriate shards, we can use a run linked in that comment, for which we only ran 2 manually specified 10Mb Shapeit4 shards: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6e2c5ed2-a102-466e-9dbc-8eae7bae1021

This run contains an filtered+windowed+FilterAndConcat (where FilterAndConcat refers to singleton filtering and short+SV concatenation) VCF over all of chr6, not just the 2 10Mb shards, which is the input to Shapeit4: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/6e2c5ed2-a102-466e-9dbc-8eae7bae1021/PhasedPanelEvaluation/bf009a41-e06b-446e-a62b-03373f65dc8c/call-FilterAndConcatVcfs/HPRCAOU.chr6.filter_and_concat.vcf.gz

We can use the GLIMPSE1 chunk tool on this VCF to generate better shards than our naive 10Mb shards:

./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 10000000 --buffer-size 500000 -O chr6.chunks.tsv

[GLIMPSE] Split chromosomes into chunks
  * Author        : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : simone.rubinacci@unil.ch & olivier.delaneau@unil.ch
  * Version       : 1.1.1
  * Run date      : 03/10/2024 - 20:59:14

Files:
  * Input VCF      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * Chromosome     : [chr6]
  * Output file    : [chr6.chunks.tsv]

Parameters:
  * Seed             : 15052011
  * #Threads   : 1
  * Min. Window size : 10000000bp / 1000 variants
  * Min. Buffer size : 500000bp / 100 variants

Reading input files
  * Main      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * #variants = 1009520 (385.52s)

Splitting data into chunks and writting to [chr6.chunks.tsv]
  * Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
  * Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
  * Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
  * Terminal window [0] -buffer:[chr6:72241-19099133] / +buffer:[chr6:72241-19599242] / L=19026893bp / C=126191
  * Terminal window [1] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
  * Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
  * Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
  * Terminal window [2] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
  * Terminal window [3] -buffer:[chr6:42755911-54941581] / +buffer:[chr6:42255897-55441622] / L=12185671bp / C=63095
  * Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
  * Terminal window [4] -buffer:[chr6:54941691-67331114] / +buffer:[chr6:54441197-67831170] / L=12389424bp / C=63095
  * Terminal window [5] -buffer:[chr6:67331412-78630557] / +buffer:[chr6:66796360-79162564] / L=11299146bp / C=63094
  * Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
  * Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
  * Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
  * Terminal window [6] -buffer:[chr6:78630635-91152068] / +buffer:[chr6:78130532-91652103] / L=12521434bp / C=63096
  * Terminal window [7] -buffer:[chr6:91152292-104071527] / +buffer:[chr6:90652274-104577123] / L=12919236bp / C=63095
  * Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
  * Terminal window [8] -buffer:[chr6:104071528-118372982] / +buffer:[chr6:103566858-118885012] / L=14301455bp / C=63095
  * Terminal window [9] -buffer:[chr6:118373301-132352709] / +buffer:[chr6:117873097-132988895] / L=13979409bp / C=63094
  * Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
  * Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
  * Terminal window [10] -buffer:[chr6:132352735-146564675] / +buffer:[chr6:131821158-147064995] / L=14211941bp / C=63098
  * Terminal window [11] -buffer:[chr6:146564679-157241583] / +buffer:[chr6:146055131-157741653] / L=10676905bp / C=63092
  * Terminal window [12] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
  * #chunks = 13

Total running time = 385 seconds

Cutting the regions from the resulting file yields:

cut -f 3 chr6.chunks.tsv

chr6:72241-19599242
chr6:18577569-33141881
chr6:32070923-43280008
chr6:42255897-55441622
chr6:54441197-67831170
chr6:66796360-79162564
chr6:78130532-91652103
chr6:90652274-104577123
chr6:103566858-118885012
chr6:117873097-132988895
chr6:131821158-147064995
chr6:146055131-157741653
chr6:156680356-170744989

Copied to gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/scratch/slee/chr6.chunks.region.tsv.

Did this on a VM manually, we should put it into the workflow before kicking off WG. Probably doesn't make much of a difference since our panel is probably plenty dense, but it's the sort of thing you're supposed to do, so we might as well.

Kicked off the PhasedPanelEvaluationFromHiPhase workflow leaving out 40 HPRC samples here: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/913b51d5-df0b-4919-a6cb-6db485d6c1b9

If the numbers look good, we should run InputPhasedPanelEvaluation using the Shapeit4 result (which is on the full panel) here to generate the full KAGE+GLIMPSE panel, rather than the leave-out. Then we can reimpute MAGE and get feedback on whether this reduced chr6 panel with fewer short variants is acceptable for eQTLs. If so, then we can proceed to WG. Alternatively, if the costs start looking more reasonable without HiPhase, then we can go ahead with the unfiltered/windowed panel.

UPDATE: The second shard consistently fails with this scheme, even going up to a very underutilized 96GB (see https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/080bbe5c-ef87-4fd7-955c-62bd1ec3ba21 and https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/bd097085-f637-4579-a787-5720d27ae440). This shard has the most variants at ~140k, although a couple of others have ~120-130k and succeed. Others have ~60k.

Bumping down the min shard size to 5Mb yields:

./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 5000000 --buffer-size 500000 -O chr6.chunks.tsv --thread 4

[GLIMPSE] Split chromosomes into chunks
  * Author        : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : simone.rubinacci@unil.ch & olivier.delaneau@unil.ch
  * Version       : 1.1.1
  * Run date      : 04/10/2024 - 02:32:49

Files:
  * Input VCF      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * Chromosome     : [chr6]
  * Output file    : [chr6.chunks.tsv]

Parameters:
  * Seed             : 15052011
  * #Threads   : 4
  * Min. Window size : 5000000bp / 1000 variants
  * Min. Buffer size : 500000bp / 100 variants

Reading input files
  * Main      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * #variants = 1009520 (323.05s)

Splitting data into chunks and writting to [chr6.chunks.tsv]
  * Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
  * Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
  * Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
  * Internal window [chr6:72241-19099133] / L=19026893bp / C=126191
  * Terminal window [0] -buffer:[chr6:72241-8411080] / +buffer:[chr6:72241-8911239] / L=8338840bp / C=63096
  * Internal window [chr6:8411530-19099133] / L=10687604bp / C=63095
  * Terminal window [1] -buffer:[chr6:8411530-13529316] / +buffer:[chr6:7911474-14029318] / L=5117787bp / C=31548
  * Terminal window [2] -buffer:[chr6:13529476-19099133] / +buffer:[chr6:13026927-19599242] / L=5569658bp / C=31547
  * Terminal window [3] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
  * Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
  * Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
  * Terminal window [4] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
  * Internal window [chr6:42755911-54941581] / L=12185671bp / C=63095
  * Terminal window [5] -buffer:[chr6:42755911-48786404] / +buffer:[chr6:42255897-49306975] / L=6030494bp / C=31548
  * Terminal window [6] -buffer:[chr6:48786493-54941581] / +buffer:[chr6:48259536-55441622] / L=6155089bp / C=31547
  * Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
  * Internal window [chr6:54941691-67331114] / L=12389424bp / C=63095
  * Terminal window [7] -buffer:[chr6:54941691-61954841] / +buffer:[chr6:54441197-62483380] / L=7013151bp / C=31551
  * Terminal window [8] -buffer:[chr6:61954846-67331114] / +buffer:[chr6:61454804-67831170] / L=5376269bp / C=31544
  * Internal window [chr6:67331412-78630557] / L=11299146bp / C=63094
  * Terminal window [9] -buffer:[chr6:67331412-73228807] / +buffer:[chr6:66796360-73751975] / L=5897396bp / C=31548
  * Terminal window [10] -buffer:[chr6:73229192-78630557] / +buffer:[chr6:72728849-79162564] / L=5401366bp / C=31546
  * Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
  * Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
  * Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
  * Internal window [chr6:78630635-91152068] / L=12521434bp / C=63096
  * Terminal window [11] -buffer:[chr6:78630635-84669766] / +buffer:[chr6:78130532-85169973] / L=6039132bp / C=31550
  * Terminal window [12] -buffer:[chr6:84669877-91152068] / +buffer:[chr6:84168043-91652103] / L=6482192bp / C=31546
  * Internal window [chr6:91152292-104071527] / L=12919236bp / C=63095
  * Terminal window [13] -buffer:[chr6:91152292-97518137] / +buffer:[chr6:90652274-98065574] / L=6365846bp / C=31548
  * Terminal window [14] -buffer:[chr6:97518139-104071527] / +buffer:[chr6:97018004-104577123] / L=6553389bp / C=31547
  * Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
  * Internal window [chr6:104071528-118372982] / L=14301455bp / C=63095
  * Terminal window [15] -buffer:[chr6:104071528-110454782] / +buffer:[chr6:103566858-110954891] / L=6383255bp / C=31548
  * Terminal window [16] -buffer:[chr6:110454783-118372982] / +buffer:[chr6:109925816-118885012] / L=7918200bp / C=31547
  * Internal window [chr6:118373301-132352709] / L=13979409bp / C=63094
  * Terminal window [17] -buffer:[chr6:118373301-125099407] / +buffer:[chr6:117873097-125599569] / L=6726107bp / C=31548
  * Terminal window [18] -buffer:[chr6:125099514-132352709] / +buffer:[chr6:124599490-132988895] / L=7253196bp / C=31546
  * Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
  * Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
  * Internal window [chr6:132352735-146564675] / L=14211941bp / C=63098
  * Terminal window [19] -buffer:[chr6:132352735-138960888] / +buffer:[chr6:131821158-139482591] / L=6608154bp / C=31550
  * Terminal window [20] -buffer:[chr6:138960971-146564675] / +buffer:[chr6:138460884-147064995] / L=7603705bp / C=31548
  * Internal window [chr6:146564679-157241583] / L=10676905bp / C=63092
  * Terminal window [21] -buffer:[chr6:146564679-151991440] / +buffer:[chr6:146055131-152491564] / L=5426762bp / C=31547
  * Terminal window [22] -buffer:[chr6:151991503-157241583] / +buffer:[chr6:151491219-157741653] / L=5250081bp / C=31545
  * Terminal window [23] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
  * #chunks = 24

Total running time = 323 seconds

chr6:72241-8911239
chr6:7911474-14029318
chr6:13026927-19599242
chr6:18577569-33141881
chr6:32070923-43280008
chr6:42255897-49306975
chr6:48259536-55441622
chr6:54441197-62483380
chr6:61454804-67831170
chr6:66796360-73751975
chr6:72728849-79162564
chr6:78130532-85169973
chr6:84168043-91652103
chr6:90652274-98065574
chr6:97018004-104577123
chr6:103566858-110954891
chr6:109925816-118885012
chr6:117873097-125599569
chr6:124599490-132988895
chr6:131821158-139482591
chr6:138460884-147064995
chr6:146055131-152491564
chr6:151491219-157741653
chr6:156680356-170744989

UPDATE: Well, that didn't help, since variant count didn't drop in that failing shard---most likely this is HLA. Just went back to the original 13 shards and cranked up to 128GB. It might be a good idea to have a strategy for tuning runtimes to a particular sharding scheme at some point, if OOM retry continues to be so unreliable.

samuelklee commented 9 hours ago

Once these runs complete, I think we have the following cohorts for comparison on 40 HPRC samples, which should provide a basis for plots of summary statistics and accuracy vs. dipcall: