Avoid monolithic pipeline state file

sebhoerl commented 4 years ago

Currently, the pipeline.json file containing the state of all stages is one big monolithic file. This comes with problems, for instance when one wants to run the pipeline multiple times in parallel, for instance with different random seeds. This can lead to race conditions in which the pipeline.json is updated by one process, and read by the other one, etc.

Ideally, meta information about stages could be distributed in the relevant folders of the stages.

ainar commented 1 year ago

I suggest storing the hash digest of the module and all its dependencies and serializing the validation output, if needed, in the cache file and directory names, as already done for the configuration hash digest. The devalidation (should we say invalidation?) based on the module hash digest would be implicit, like the configuration check. In consequence, the other devalidation steps would not be needed anymore, as well as the pipeline.json file, because all are replaced by the check of an existing cache file:

"Devalidate if parent has been updated": the parent has been updated if the code or the configuration has been changed, which is already tracked by the cache existence. The only exception is if we manually require a stage (say, stage A) and then, in a second run, require a grand-descendent stage (say, C). Then, to devalidate the stage between A and C, say stage B, we could check if the stage B cache is older than the stage A cache. The idea is to propagate the devalidation if there are more stages between A and C.
"Devalidate if parents are not the same anymore": if the dependencies list is not the same, the module hash digest will update because it encompasses all the code of all the dependencies.
"Devalidate descendants of devalidated stages": this step does not rely on the metadata, but I wonder if it would still be needed. We would expect from descendants of devalidated stages a different configuration and/or code digest, so it will be devalidated by the cache checking anyways.

To manage the case of simultaneous runs with devalidated stages overlapping, we could generate a temporary cache file during execution, and check before each stage execution if the corresponding cache file exists (and its parent caches are older) to avoid re-running a stage that a simultaneous process has just ended.

What do you think?

sebhoerl commented 1 year ago

Yes, sounds good :)

ainar commented 1 year ago

I forgot about the "info". They could have a separate file for each step in the "cache" folder. I try a PoC.

eqasim-org / synpp

Avoid monolithic pipeline state file #45