Closed apriltuesday closed 8 months ago
Ran the evidence generation on a small set using cProfile
, I think the time-consuming part of the iteration is just validating each evidence string. This makes sense as there are no external queries in this part of the pipeline.
93779765 function calls (85532202 primitive calls) in 53.396 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 53.414 53.414 {built-in method builtins.exec}
1 0.001 0.001 53.414 53.414 <string>:1(<module>)
1 0.000 0.000 53.413 53.413 clinvar_to_evidence_strings.py:125(launch_pipeline)
1 0.027 0.027 53.400 53.400 clinvar_to_evidence_strings.py:139(clinvar_to_evidence_strings)
1131 0.007 0.000 50.904 0.045 clinvar_to_evidence_strings.py:111(validate_evidence_string)
1131 0.009 0.000 50.895 0.045 validators.py:871(validate)
3360626/2262 9.173 0.000 50.420 0.022 validators.py:296(iter_errors)
3319938/96243 2.262 0.000 49.885 0.001 validators.py:343(descend)
714792/58812 3.118 0.000 49.205 0.001 _validators.py:276(properties)
934931/170375 2.104 0.000 47.595 0.000 _validators.py:252(ref)
1131 0.003 0.000 45.929 0.041 validators.py:291(check_schema)
<snip>
This means we should be able to run through once to count how many records, and then provide ranges in e.g. 10 evenly sized chunks to run the evidence generation in parallel (@tcezard 's suggestion). As long as we don't actually generate evidence strings, iterating through the data each time will have some overhead but should still be fast enough. (And presumably the overhead would be less than actually splitting the file.)
Evidence string generation time (i.e. without consequence prediction steps) for the entire ClinVar release is now in excess of 24 hours and can only be expected to increase. Furthermore when validation errors are found, the process crashes (see here) and needs to be started from scratch.
We should try to parallelise or otherwise optimise the evidence generation step. For example, splitting the file by RCV should allow us to leverage Nextflow parallelism and resumability. We could also consider handling validation errors as is currently done in the PGKB pipeline (see here), which would let us detect all validation errors in a single run while still crashing the process in the end.
We should also possibly find ways to be more proactive about detecting changes in the data that may require schema or code changes.