clingen-data-model / clinvar-gk-python

Project for reading and normalizing ClinVar variants into GA4GH GKS forms
MIT License
0 stars 0 forks source link

Multiprocessing and timeouts #8

Closed theferrit32 closed 6 months ago

theferrit32 commented 7 months ago
theferrit32 commented 7 months ago

Ran this with the current args at the bottom of main.py and it finished in about the ~2.8 million variants in ~21 minutes, having skipped 47 variants which took longer than 10 seconds.

theferrit32 commented 7 months ago
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_1.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280054
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_2.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_3.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_4.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_5.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_6.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_7.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_8.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_9.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_10.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Output written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Output uploaded to gs://clinvar-gk-pilot/2024-04-07/dev/vi-output.json.gz
python clinvar_gk_pilot/main.py 2>&1  5730.73s user 2977.07s system 629% cpu 23:02.91 total
tee log  0.00s user 0.01s system 0% cpu 23:02.96 total

errors due to task timeout:

zgrep -rn "errors" output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz | grep "did not complete" | wc -l
5
toneillbroad commented 7 months ago

I identified four variants that were causing long unknown region processing times that were causing "runaway" processing. 47 seems too large a number.

{"variation_id":"1687628","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2831_2832dup","precedence":"4","variation_type":"Duplication","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1687107","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.5185del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1691679","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2897_2953del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1691680","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.7211_7214del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}

theferrit32 commented 7 months ago

Thanks for the info on those, @toneillbroad. With a 1 minute timeout I got those same 4, which is good validation, plus 1 other one, variation_id 11668.

2565051:{"errors": "Task did not complete in 60 seconds.", "line": "{\"variation_id\":\"11668\",\"name\":\"NM_004586.3(RPS6KA3):c.1444_1959dup (p.Val482_Lys653dup)\",\"accession\":\"NG_007488.1\",\"vrs_class\":\"Allele\",\"range_copies\":[],\"fmt\":\"hgvs\",\"source\":\"NG_007488.1:g.103742_114797dup\",\"precedence\":\"5\",\"variation_type\":\"Duplication\",\"subclass_type\":\"SimpleAllele\",\"cytogenetic\":\"Xp22.2-p22.1\",\"mappings\":[]}\n"}

I'm not sure why this one took longer than a minute, the reference sequence is only 515 bases.