Vizualisation of multiple runs of the same workflow with checkpointing

benclifford commented 4 years ago

Is your feature request related to a problem? Please describe. Sometimes a parsl python program is run several times without change with checkpointing turned on to drive towards a final completed set of outputs.

The present visualization code doesn't give much in the way of seeing what happened: it will show each run as a separate workflow run which can be visualised, but can't show anything more integrated.

Describe the solution you'd like There are different things that could happen here. In many of the time based plots, it might make sense to concatenate the graphs for each of a series of runs along the x-axis.

@tomglanzman has experimented a bit with using the task_hashsum field to tie information together about invocations about apps across multiple runs - for example, given a cached app invocation in the current run, go back through previous runs to find the original execution information. This addresses questions like "give me a histogram of the execution time of the runs of each command", discarding failed attempts and memoized attempts which take almost no time.

Additional context This is in the context of visualizing large runs for LSST DESC DM work.

ZhuozhaoLi commented 4 years ago

So for each task in a workflow, if its task_hashsum is not None, we want to traceback to previous runs for its info (e.g., status and ).

I could imagine situations that miss some info, for example, if a user removes the monitoring db but does not remove runinfo (so it has checkpoints but does not have any records in monitoring db).

TomGlanzman commented 4 years ago

Hi @ZhuozhaoLi Yes, the user could remove monitoring.db without removing runinfo and that would be bad. Would it make sense to store monitoring.db inside the runinfo/ directory? That way there would be a better chance of all-or-nothing.

Parsl / parsl

Vizualisation of multiple runs of the same workflow with checkpointing #1725