Multiprocessing and timeouts

theferrit32 commented 7 months ago

Adds catvar_combiner.py (which can be adapted and genericized later) to combine a number of NDJSON files into a single file with a single JSON document with keys being the id values from each line of the NDJSON file.
Define some logic to generate a local relative path for caching gs:// files. Default is buckets/<bucket>/<blob-prefix>/<blob-basename>. e.g. "gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz" gets cached to ./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Add logic to only re-download a gs:// file if it doesn't already exist in the default local cache directory.
Write output files to output directory, with the same relative path under there as the input file. e.g. gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz gets cached to ./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz and the output gets written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Add optional parallelism. Partitions the input file into N number of files with equal numbers of lines, and executes a process over each of those partitioned files. Takes the number of partitions with the CLI arg --parallelism
When --parallelism is not 0, also runs each task (e.g. process_line(line) for each line of input) in a separate process which can be interrupted after some timeout. This lets us stop normalization of variants that take too long because they are nonsensical (e.g. deleting an N inside a huge N region of the genomic reference sequence. see https://github.com/ga4gh/vrs-python/issues/397)

theferrit32 commented 7 months ago

Ran this with the current args at the bottom of main.py and it finished in about the ~2.8 million variants in ~21 minutes, having skipped 47 variants which took longer than 10 seconds.

theferrit32 commented 7 months ago

Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_1.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280054
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_2.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_3.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_4.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_5.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_6.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_7.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_8.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_9.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_10.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Output written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Output uploaded to gs://clinvar-gk-pilot/2024-04-07/dev/vi-output.json.gz
python clinvar_gk_pilot/main.py 2>&1  5730.73s user 2977.07s system 629% cpu 23:02.91 total
tee log  0.00s user 0.01s system 0% cpu 23:02.96 total

errors due to task timeout:

zgrep -rn "errors" output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz | grep "did not complete" | wc -l
5

toneillbroad commented 7 months ago

I identified four variants that were causing long unknown region processing times that were causing "runaway" processing. 47 seems too large a number.

{"variation_id":"1687628","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2831_2832dup","precedence":"4","variation_type":"Duplication","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1687107","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.5185del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1691679","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2897_2953del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]} {"variation_id":"1691680","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.7211_7214del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}

theferrit32 commented 7 months ago

Thanks for the info on those, @toneillbroad. With a 1 minute timeout I got those same 4, which is good validation, plus 1 other one, variation_id 11668.

2565051:{"errors": "Task did not complete in 60 seconds.", "line": "{\"variation_id\":\"11668\",\"name\":\"NM_004586.3(RPS6KA3):c.1444_1959dup (p.Val482_Lys653dup)\",\"accession\":\"NG_007488.1\",\"vrs_class\":\"Allele\",\"range_copies\":[],\"fmt\":\"hgvs\",\"source\":\"NG_007488.1:g.103742_114797dup\",\"precedence\":\"5\",\"variation_type\":\"Duplication\",\"subclass_type\":\"SimpleAllele\",\"cytogenetic\":\"Xp22.2-p22.1\",\"mappings\":[]}\n"}

I'm not sure why this one took longer than a minute, the reference sequence is only 515 bases.

clingen-data-model / clinvar-gk-python

Multiprocessing and timeouts #8