Closed theferrit32 closed 6 months ago
Ran this with the current args at the bottom of main.py
and it finished in about the ~2.8 million variants in ~21 minutes, having skipped 47 variants which took longer than 10 seconds.
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_1.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280054
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_2.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_3.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_4.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_5.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_6.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_7.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_8.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_9.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_10.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Output written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Output uploaded to gs://clinvar-gk-pilot/2024-04-07/dev/vi-output.json.gz
python clinvar_gk_pilot/main.py 2>&1 5730.73s user 2977.07s system 629% cpu 23:02.91 total
tee log 0.00s user 0.01s system 0% cpu 23:02.96 total
errors due to task timeout:
zgrep -rn "errors" output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz | grep "did not complete" | wc -l
5
I identified four variants that were causing long unknown region processing times that were causing "runaway" processing. 47 seems too large a number.
{"variation_id":"1687628","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2831_2832dup","precedence":"4","variation_type":"Duplication","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1687107","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.5185del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1691679","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2897_2953del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1691680","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.7211_7214del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}
Thanks for the info on those, @toneillbroad. With a 1 minute timeout I got those same 4, which is good validation, plus 1 other one, variation_id 11668.
2565051:{"errors": "Task did not complete in 60 seconds.", "line": "{\"variation_id\":\"11668\",\"name\":\"NM_004586.3(RPS6KA3):c.1444_1959dup (p.Val482_Lys653dup)\",\"accession\":\"NG_007488.1\",\"vrs_class\":\"Allele\",\"range_copies\":[],\"fmt\":\"hgvs\",\"source\":\"NG_007488.1:g.103742_114797dup\",\"precedence\":\"5\",\"variation_type\":\"Duplication\",\"subclass_type\":\"SimpleAllele\",\"cytogenetic\":\"Xp22.2-p22.1\",\"mappings\":[]}\n"}
I'm not sure why this one took longer than a minute, the reference sequence is only 515 bases.
id
values from each line of the NDJSON file.gs://
files. Default isbuckets/<bucket>/<blob-prefix>/<blob-basename>
. e.g."gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz"
gets cached to./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
gs://
file if it doesn't already exist in the default local cache directory.output
directory, with the same relative path under there as the input file. e.g.gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
gets cached to./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
and the output gets written tooutput/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
--parallelism
--parallelism
is not 0, also runs each task (e.g.process_line(line)
for each line of input) in a separate process which can be interrupted after some timeout. This lets us stop normalization of variants that take too long because they are nonsensical (e.g. deleting an N inside a huge N region of the genomic reference sequence. see https://github.com/ga4gh/vrs-python/issues/397)