Open samuelklee opened 1 month ago
Updates:
Thanks for the update! Two things:
1) Note that one flaw of the cost-estimation script is that it doesn’t actually give the real cost. It doesn’t account for preemption and cost rates for various resources are provided as fixed parameters. But hopefully we can roughly trust the distribution it reports.
2) If so, this suggests that with the per-sample bcftools view
, at best we can hope for about a 50-50 split between subsetting and HiPhase. This means we will end up paying at least $500 just to subset the joint VCF! Surely a bcftools split
strategy can beat that?
Will add notes on submissions for chr6 evaluations against dipcall truth here:
40 HPRC GATK-SV, w/o realign flags, 5 samples failing: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/16dd0d83-52da-4e48-b81e-fe51a87a90f7 40 HPRC GATK-SV, w/ realign flags, 1 sample failing: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/5aa1a8f0-8235-4e39-b11f-85266bc02bde
40 HPRC SR vs. full HPRC+AoU1, w/o realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/afd5183c-cc35-4949-a5bc-6f4f61b9e585 40 HPRC SR vs. full HPRC+AoU1, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/c642a3ca-afeb-40b8-a030-f46464867a3d
InputPhasedPanelEvaluation construction/evaluation of HPRC+AoU1-leave-out-40-HPRC, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/eb8b7b5c-4962-4b4f-a7d4-42ec72dd0cbe
40 HPRC LR HPRC+AoU1 filter+concat, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/128c0302-7110-43f7-93e5-95633875ea0d 40 HPRC LR HPRC+AoU1 panel, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/ef9cf97c-6342-42e6-a6e7-4772b9882e39
Note that the realign flags seem to make a minimal impact, at least from spot checking; I'll double-check with plots across all samples.
Given maintained accuracy seen in https://github.com/broadinstitute/lrma-aou1-panel-creation/issues/38#issuecomment-239236454, we can proceed with a sharded run with AF>=0.01 + 10kb SV windowing over all of chr6.
To determine more appropriate shards, we can use a run linked in that comment, for which we only ran 2 manually specified 10Mb Shapeit4 shards: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6e2c5ed2-a102-466e-9dbc-8eae7bae1021
This run contains an filtered+windowed+FilterAndConcat (where FilterAndConcat refers to singleton filtering and short+SV concatenation) VCF over all of chr6, not just the 2 10Mb shards, which is the input to Shapeit4: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/6e2c5ed2-a102-466e-9dbc-8eae7bae1021/PhasedPanelEvaluation/bf009a41-e06b-446e-a62b-03373f65dc8c/call-FilterAndConcatVcfs/HPRCAOU.chr6.filter_and_concat.vcf.gz
We can use the GLIMPSE1 chunk tool on this VCF to generate better shards than our naive 10Mb shards:
./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 10000000 --buffer-size 500000 -O chr6.chunks.tsv
[GLIMPSE] Split chromosomes into chunks
* Author : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
* Contact : simone.rubinacci@unil.ch & olivier.delaneau@unil.ch
* Version : 1.1.1
* Run date : 03/10/2024 - 20:59:14
Files:
* Input VCF : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
* Chromosome : [chr6]
* Output file : [chr6.chunks.tsv]
Parameters:
* Seed : 15052011
* #Threads : 1
* Min. Window size : 10000000bp / 1000 variants
* Min. Buffer size : 500000bp / 100 variants
Reading input files
* Main : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
* #variants = 1009520 (385.52s)
Splitting data into chunks and writting to [chr6.chunks.tsv]
* Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
* Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
* Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
* Terminal window [0] -buffer:[chr6:72241-19099133] / +buffer:[chr6:72241-19599242] / L=19026893bp / C=126191
* Terminal window [1] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
* Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
* Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
* Terminal window [2] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
* Terminal window [3] -buffer:[chr6:42755911-54941581] / +buffer:[chr6:42255897-55441622] / L=12185671bp / C=63095
* Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
* Terminal window [4] -buffer:[chr6:54941691-67331114] / +buffer:[chr6:54441197-67831170] / L=12389424bp / C=63095
* Terminal window [5] -buffer:[chr6:67331412-78630557] / +buffer:[chr6:66796360-79162564] / L=11299146bp / C=63094
* Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
* Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
* Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
* Terminal window [6] -buffer:[chr6:78630635-91152068] / +buffer:[chr6:78130532-91652103] / L=12521434bp / C=63096
* Terminal window [7] -buffer:[chr6:91152292-104071527] / +buffer:[chr6:90652274-104577123] / L=12919236bp / C=63095
* Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
* Terminal window [8] -buffer:[chr6:104071528-118372982] / +buffer:[chr6:103566858-118885012] / L=14301455bp / C=63095
* Terminal window [9] -buffer:[chr6:118373301-132352709] / +buffer:[chr6:117873097-132988895] / L=13979409bp / C=63094
* Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
* Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
* Terminal window [10] -buffer:[chr6:132352735-146564675] / +buffer:[chr6:131821158-147064995] / L=14211941bp / C=63098
* Terminal window [11] -buffer:[chr6:146564679-157241583] / +buffer:[chr6:146055131-157741653] / L=10676905bp / C=63092
* Terminal window [12] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
* #chunks = 13
Total running time = 385 seconds
Cutting the regions from the resulting file yields:
cut -f 3 chr6.chunks.tsv
chr6:72241-19599242
chr6:18577569-33141881
chr6:32070923-43280008
chr6:42255897-55441622
chr6:54441197-67831170
chr6:66796360-79162564
chr6:78130532-91652103
chr6:90652274-104577123
chr6:103566858-118885012
chr6:117873097-132988895
chr6:131821158-147064995
chr6:146055131-157741653
chr6:156680356-170744989
Copied to gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/scratch/slee/chr6.chunks.region.tsv
.
Did this on a VM manually, we should put it into the workflow before kicking off WG. Probably doesn't make much of a difference since our panel is probably plenty dense, but it's the sort of thing you're supposed to do, so we might as well.
Kicked off the PhasedPanelEvaluationFromHiPhase workflow leaving out 40 HPRC samples here: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/913b51d5-df0b-4919-a6cb-6db485d6c1b9
If the numbers look good, we should run InputPhasedPanelEvaluation using the Shapeit4 result (which is on the full panel) here to generate the full KAGE+GLIMPSE panel, rather than the leave-out. Then we can reimpute MAGE and get feedback on whether this reduced chr6 panel with fewer short variants is acceptable for eQTLs. If so, then we can proceed to WG. Alternatively, if the costs start looking more reasonable without HiPhase, then we can go ahead with the unfiltered/windowed panel.
UPDATE: The second shard consistently fails with this scheme, even going up to a very underutilized 96GB (see https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/080bbe5c-ef87-4fd7-955c-62bd1ec3ba21 and https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/bd097085-f637-4579-a787-5720d27ae440). This shard has the most variants at ~140k, although a couple of others have ~120-130k and succeed. Others have ~60k.
Bumping down the min shard size to 5Mb yields:
./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 5000000 --buffer-size 500000 -O chr6.chunks.tsv --thread 4
[GLIMPSE] Split chromosomes into chunks
* Author : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
* Contact : simone.rubinacci@unil.ch & olivier.delaneau@unil.ch
* Version : 1.1.1
* Run date : 04/10/2024 - 02:32:49
Files:
* Input VCF : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
* Chromosome : [chr6]
* Output file : [chr6.chunks.tsv]
Parameters:
* Seed : 15052011
* #Threads : 4
* Min. Window size : 5000000bp / 1000 variants
* Min. Buffer size : 500000bp / 100 variants
Reading input files
* Main : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
* #variants = 1009520 (323.05s)
Splitting data into chunks and writting to [chr6.chunks.tsv]
* Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
* Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
* Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
* Internal window [chr6:72241-19099133] / L=19026893bp / C=126191
* Terminal window [0] -buffer:[chr6:72241-8411080] / +buffer:[chr6:72241-8911239] / L=8338840bp / C=63096
* Internal window [chr6:8411530-19099133] / L=10687604bp / C=63095
* Terminal window [1] -buffer:[chr6:8411530-13529316] / +buffer:[chr6:7911474-14029318] / L=5117787bp / C=31548
* Terminal window [2] -buffer:[chr6:13529476-19099133] / +buffer:[chr6:13026927-19599242] / L=5569658bp / C=31547
* Terminal window [3] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
* Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
* Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
* Terminal window [4] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
* Internal window [chr6:42755911-54941581] / L=12185671bp / C=63095
* Terminal window [5] -buffer:[chr6:42755911-48786404] / +buffer:[chr6:42255897-49306975] / L=6030494bp / C=31548
* Terminal window [6] -buffer:[chr6:48786493-54941581] / +buffer:[chr6:48259536-55441622] / L=6155089bp / C=31547
* Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
* Internal window [chr6:54941691-67331114] / L=12389424bp / C=63095
* Terminal window [7] -buffer:[chr6:54941691-61954841] / +buffer:[chr6:54441197-62483380] / L=7013151bp / C=31551
* Terminal window [8] -buffer:[chr6:61954846-67331114] / +buffer:[chr6:61454804-67831170] / L=5376269bp / C=31544
* Internal window [chr6:67331412-78630557] / L=11299146bp / C=63094
* Terminal window [9] -buffer:[chr6:67331412-73228807] / +buffer:[chr6:66796360-73751975] / L=5897396bp / C=31548
* Terminal window [10] -buffer:[chr6:73229192-78630557] / +buffer:[chr6:72728849-79162564] / L=5401366bp / C=31546
* Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
* Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
* Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
* Internal window [chr6:78630635-91152068] / L=12521434bp / C=63096
* Terminal window [11] -buffer:[chr6:78630635-84669766] / +buffer:[chr6:78130532-85169973] / L=6039132bp / C=31550
* Terminal window [12] -buffer:[chr6:84669877-91152068] / +buffer:[chr6:84168043-91652103] / L=6482192bp / C=31546
* Internal window [chr6:91152292-104071527] / L=12919236bp / C=63095
* Terminal window [13] -buffer:[chr6:91152292-97518137] / +buffer:[chr6:90652274-98065574] / L=6365846bp / C=31548
* Terminal window [14] -buffer:[chr6:97518139-104071527] / +buffer:[chr6:97018004-104577123] / L=6553389bp / C=31547
* Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
* Internal window [chr6:104071528-118372982] / L=14301455bp / C=63095
* Terminal window [15] -buffer:[chr6:104071528-110454782] / +buffer:[chr6:103566858-110954891] / L=6383255bp / C=31548
* Terminal window [16] -buffer:[chr6:110454783-118372982] / +buffer:[chr6:109925816-118885012] / L=7918200bp / C=31547
* Internal window [chr6:118373301-132352709] / L=13979409bp / C=63094
* Terminal window [17] -buffer:[chr6:118373301-125099407] / +buffer:[chr6:117873097-125599569] / L=6726107bp / C=31548
* Terminal window [18] -buffer:[chr6:125099514-132352709] / +buffer:[chr6:124599490-132988895] / L=7253196bp / C=31546
* Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
* Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
* Internal window [chr6:132352735-146564675] / L=14211941bp / C=63098
* Terminal window [19] -buffer:[chr6:132352735-138960888] / +buffer:[chr6:131821158-139482591] / L=6608154bp / C=31550
* Terminal window [20] -buffer:[chr6:138960971-146564675] / +buffer:[chr6:138460884-147064995] / L=7603705bp / C=31548
* Internal window [chr6:146564679-157241583] / L=10676905bp / C=63092
* Terminal window [21] -buffer:[chr6:146564679-151991440] / +buffer:[chr6:146055131-152491564] / L=5426762bp / C=31547
* Terminal window [22] -buffer:[chr6:151991503-157241583] / +buffer:[chr6:151491219-157741653] / L=5250081bp / C=31545
* Terminal window [23] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
* #chunks = 24
Total running time = 323 seconds
chr6:72241-8911239
chr6:7911474-14029318
chr6:13026927-19599242
chr6:18577569-33141881
chr6:32070923-43280008
chr6:42255897-49306975
chr6:48259536-55441622
chr6:54441197-62483380
chr6:61454804-67831170
chr6:66796360-73751975
chr6:72728849-79162564
chr6:78130532-85169973
chr6:84168043-91652103
chr6:90652274-98065574
chr6:97018004-104577123
chr6:103566858-110954891
chr6:109925816-118885012
chr6:117873097-125599569
chr6:124599490-132988895
chr6:131821158-139482591
chr6:138460884-147064995
chr6:146055131-152491564
chr6:151491219-157741653
chr6:156680356-170744989
UPDATE: Well, that didn't help, since variant count didn't drop in that failing shard---most likely this is HLA. Just went back to the original 13 shards and cranked up to 128GB. It might be a good idea to have a strategy for tuning runtimes to a particular sharding scheme at some point, if OOM retry continues to be so unreliable.
Completed: PhasedPanelEvaluationFromHiPhase (only failed input short evaluation): https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/3063a8e4-93a8-424a-81c4-520699d4c941 InputPhasedPanelEvaluation continuation: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/73371f89-5a5b-4b5b-bafd-57f9ab5de853
Once these runs complete, I think we have the following cohorts for comparison on 40 HPRC samples, which should provide a basis for plots of summary statistics and accuracy vs. dipcall:
Notes for resolving GATK-SV symbolic alleles:
python ~/Downloads/resolve_light.py 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.vcf.gz /mnt/4AB658D7B658C4DB/working/ref/Homo_sapiens_assembly38.fasta | bgzip -c > 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.vcf.gz
bbcftools query -f '%CHROM\t%POS\t%ID\t%REFSEQ\t%INSSEQ\n' 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.vcf.gz | bgzip -c > 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.annot.tsv.gz
tabix -s1 -b2 -e2 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.annot.tsv.gz
echo "##INFO=<ID=INSSEQ,Number=.,Type=String,Description=\"Sequence of insertion\">" > header.txt
echo "##INFO=<ID=REFSEQ,Number=.,Type=String,Description="Original reference sequence">" >> header.txt
bcftools annotate 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.vcf.gz -a 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.annot.tsv.gz -c CHROM,POS,~ID,REFSEQ,INSSEQ -h header.txt | bgzip -c > 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.annot.vcf.gz
python ~/Downloads/copy_seq.py 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.annot.vcf.gz /mnt/4AB658D7B658C4DB/working/ref/Homo_sapiens_assembly38.fasta | bgzip -c > 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.resolved.vcf.gz
Preliminary version of plot revealed only short-variant alleles are resolved, yielding ~10% SV recall: https://docs.google.com/presentation/d/1z5TFuvydlCBbskWCsGzLnloHJKkiF1U9AIr8RfxM1TQ/edit?usp=sharing
40 HPRC GATK-SV resolved, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/30beaa70-385c-4ef8-ba9b-05c70446732f
More preliminary figures here, showing Byrska-Bishop phased panel recall ~40% after resolving DELs and INSs as above: https://docs.google.com/presentation/d/1z5TFuvydlCBbskWCsGzLnloHJKkiF1U9AIr8RfxM1TQ/edit#slide=id.p
Currently rerunning with the MHC (chr6:28,510,120-33,480,577) masked out of the non-TR/homopolymer BED with bedtools:
LR filter+concat (unimputed): https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/9d7cb577-02ea-4962-86ac-c97c32967829
LR panel (imputed): https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/69b8bce4-2877-4470-9f6e-c1824e88f5a1
SR vs. leave-out-40: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/c73a6547-b851-4ebd-b6c4-4e29b288fb63
LR filtered/windowed filter+concat (unimputed): https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/5d8afec7-0e8a-4e7c-925c-4c9905e517bf
LR filtered/windowed panel (imputed) and SR vs. filtered/windowed leave-out-40: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/0d02de3d-5e94-4dcd-a503-5e2146ddd2cd
HPRC + AoU 1074 sample whole genome hiphased vcfs have been completed and pasted to the terra work table. The cost varies from 0.1-0.9$ per sample. Currently merging it to a multi-sample filtered vcfs.
Will come back and write a status update now that physical+statistical phasing is complete, but see status slide from today.
Hit a snafu when building the KAGE panel. The KAGE obgraph package didn't like a site at chr7:57336990
and threw an exception:
chr7 57336989 . T TG . PASS ID=chr7-57336989-allele618843-2
chr7 57336990 . GT GG . PASS ID=chr7-57336991-allele618845-1
Surprised this is the only instance in ~20M variants. Not quite sure why it's problematic, but it may have something to do with the fact that it's not atomized---in which case, not quite sure why the upstream bcftools norm -a
in PanGeniePanelCreation didn't take care of it.
Will manually remove and rerun for now.
@hangsuUNC Run HiPhase on whole genome using current parameters with no filtering for all samples:
@samuelklee Experiment with reducing short-variant count for Shapeit4 while maintaining SV imputation performance (Iris is focusing on AF > 0.05 for eQTL analyses):
@hangsuUNC Statistical phasing:
Goal is for phasing to be complete done within the first week or so. Hopefully, HiPhase can be done in the next day or two, at the very least; Shapeit4 could stand some tinkering with cutting variant count, but at some point we should just bite the bullet. Perhaps after running the sharded workflow on a single chromosome and confirming the cost. What do you think, @hangsuUNC?
@samuelklee Main figure (w/ chr6 first, WG when ready):
@samuelklee Generate inputs for eQTL:
@samuelklee Supplementary figures:
@kvg anything missing?
EDIT: Rather than leave-out-all-HPRC, let's do leave-out-40-HPRC---these are the 40 in TGP with readily available SR.