aiidateam / aiida-workgraph

Efficiently design and manage flexible workflows with AiiDA, featuring an interactive GUI, checkpoints, provenance tracking, and remote execution capabilities.
https://aiida-workgraph.readthedocs.io/en/latest/
MIT License
9 stars 5 forks source link

Restore non-AiiDA process task from checkpoint #242

Closed superstar54 closed 2 months ago

superstar54 commented 2 months ago

If the daemon stops when a non-AiiDA process task is running, after we start the daemon, the running task will hang up.

non-AiiDA process includes:

superstar54 commented 2 months ago

Submit a while loop, and stop and start the daemon, get this error.

2024-08-19 20:12:44 [169487 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:12:44 [169488 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: tasks ready to run: while3
2024-08-19 20:12:44 [169489 | REPORT]: [118860|WorkGraphEngine|run_tasks]: Run task: while3, type: WHILE
2024-08-19 20:12:44 [169490 | REPORT]: [118860|WorkGraphEngine|run_tasks]: While Task while3: Condition not fullilled, task finished. Skip all its children.
2024-08-19 20:12:46 [169491 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:12:46 [169492 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: tasks ready to run: add12
2024-08-19 20:12:46 [169493 | REPORT]: [118860|WorkGraphEngine|run_tasks]: Run task: add12, type: CALCFUNCTION
2024-08-19 20:12:47 [169494 | REPORT]: [118860|WorkGraphEngine|update_task_state]: Task: add12 finished.
2024-08-19 20:12:47 [169495 | REPORT]: [118860|WorkGraphEngine|update_while_task_state]: Wihle Task while1: this iteration finished. Try to reset for the next iteration.
2024-08-19 20:12:48 [169496 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:12:48 [169497 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: tasks ready to run: compare1
2024-08-19 20:12:48 [169498 | REPORT]: [118860|WorkGraphEngine|run_tasks]: Run task: compare1, type: CALCFUNCTION
2024-08-19 20:13:17 [169499 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:13:18 [169500 | REPORT]: [118860|WorkGraphEngine|on_except]: Traceback (most recent call last):
  File "/home/xing/miniconda3/envs/aiida/lib/python3.11/site-packages/plumpy/process_states.py", line 228, in execute
    result = self.run_fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xing/repos/superstar54/aiida-workgraph/aiida_workgraph/engine/workgraph.py", line 308, in _do_step
    self.continue_workgraph()
  File "/home/xing/repos/superstar54/aiida-workgraph/aiida_workgraph/engine/workgraph.py", line 645, in continue_workgraph
    if ready and self.task_should_run(name):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xing/repos/superstar54/aiida-workgraph/aiida_workgraph/engine/workgraph.py", line 913, in task_should_run
    index = [i for i, item in enumerate(name_and_uuids) if item[1] == uuid][0]
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
superstar54 commented 2 months ago

Here's more about the checkpoint and restore. One can not restore from where the WorkGraph engine fails; instead, we restore from the checkpoint.

Case 1: A while task is running, and its execution_count is increased by one. The daemon stops. In the checkpoint, the while task is not running, and its execution_count is not increased by one. The daemon restarts, and the while task is ready to run and run again. Thus, we don't need to modify the execution_count.