EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit
Apache License 2.0
19 stars 10 forks source link

Scale up evidence string generation #416

Closed apriltuesday closed 8 months ago

apriltuesday commented 9 months ago

Evidence string generation time (i.e. without consequence prediction steps) for the entire ClinVar release is now in excess of 24 hours and can only be expected to increase. Furthermore when validation errors are found, the process crashes (see here) and needs to be started from scratch.

We should try to parallelise or otherwise optimise the evidence generation step. For example, splitting the file by RCV should allow us to leverage Nextflow parallelism and resumability. We could also consider handling validation errors as is currently done in the PGKB pipeline (see here), which would let us detect all validation errors in a single run while still crashing the process in the end.

We should also possibly find ways to be more proactive about detecting changes in the data that may require schema or code changes.

apriltuesday commented 8 months ago

Ran the evidence generation on a small set using cProfile, I think the time-consuming part of the iteration is just validating each evidence string. This makes sense as there are no external queries in this part of the pipeline.

   93779765 function calls (85532202 primitive calls) in 53.396 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   53.414   53.414 {built-in method builtins.exec}
        1    0.001    0.001   53.414   53.414 <string>:1(<module>)
        1    0.000    0.000   53.413   53.413 clinvar_to_evidence_strings.py:125(launch_pipeline)
        1    0.027    0.027   53.400   53.400 clinvar_to_evidence_strings.py:139(clinvar_to_evidence_strings)
     1131    0.007    0.000   50.904    0.045 clinvar_to_evidence_strings.py:111(validate_evidence_string)
     1131    0.009    0.000   50.895    0.045 validators.py:871(validate)
3360626/2262    9.173    0.000   50.420    0.022 validators.py:296(iter_errors)
3319938/96243    2.262    0.000   49.885    0.001 validators.py:343(descend)
714792/58812    3.118    0.000   49.205    0.001 _validators.py:276(properties)
934931/170375    2.104    0.000   47.595    0.000 _validators.py:252(ref)
     1131    0.003    0.000   45.929    0.041 validators.py:291(check_schema)

<snip>

This means we should be able to run through once to count how many records, and then provide ranges in e.g. 10 evenly sized chunks to run the evidence generation in parallel (@tcezard 's suggestion). As long as we don't actually generate evidence strings, iterating through the data each time will have some overhead but should still be fast enough. (And presumably the overhead would be less than actually splitting the file.)