eqasim-org / synpp

Synthetic population pipeline code for eqasim
http://www.eqasim.org
MIT License
18 stars 12 forks source link

Avoid monolithic pipeline state file #45

Open sebhoerl opened 4 years ago

sebhoerl commented 4 years ago

Currently, the pipeline.json file containing the state of all stages is one big monolithic file. This comes with problems, for instance when one wants to run the pipeline multiple times in parallel, for instance with different random seeds. This can lead to race conditions in which the pipeline.json is updated by one process, and read by the other one, etc.

Ideally, meta information about stages could be distributed in the relevant folders of the stages.

ainar commented 1 year ago

I suggest storing the hash digest of the module and all its dependencies and serializing the validation output, if needed, in the cache file and directory names, as already done for the configuration hash digest. The devalidation (should we say invalidation?) based on the module hash digest would be implicit, like the configuration check. In consequence, the other devalidation steps would not be needed anymore, as well as the pipeline.json file, because all are replaced by the check of an existing cache file:

To manage the case of simultaneous runs with devalidated stages overlapping, we could generate a temporary cache file during execution, and check before each stage execution if the corresponding cache file exists (and its parent caches are older) to avoid re-running a stage that a simultaneous process has just ended.

What do you think?

sebhoerl commented 1 year ago

Yes, sounds good :)

ainar commented 1 year ago

I forgot about the "info". They could have a separate file for each step in the "cache" folder. I try a PoC.