aiidateam / aiida-workgraph

Efficiently design and manage flexible workflows with AiiDA, featuring an interactive GUI, checkpoints, provenance tracking, and remote execution capabilities.
https://aiida-workgraph.readthedocs.io/en/latest/
MIT License
10 stars 5 forks source link

Feature: scheduler #275

Open superstar54 opened 3 months ago

superstar54 commented 3 months ago

Background

When running a workflow (such as a WorkChain or WorkGraph), each workflow is associated with a corresponding process. This process launches and waits for the child processes (e.g., CalcJob processes). In nested workflows like the PwBandsWorkChain, you may encounter multiple Workflow processes in a waiting state, with only one CalcJob process actively running. These waiting Workflow processes can be seen as inefficient resource usage.

In a WorkChain, the workflow logic is encapsulated within the new WorkChain class, making it challenging to eliminate these waiting processes at the moment. However, in a WorkGraph, the logic is more explicitly defined, and it has strict rules on who can execute this logic.

Besides, it's not good to run the task process and workgraph process in the same runner.

Proposal

To address this, I proposed a Scheduler for the WorkGraph in this PR. The Scheduler handles the following:

Let's compare the process count for the PwBands case. Suppose we launch 100 PwBands WorkGraphs:

The benefit is clear: the new approach significantly reduces the number of active processes. Moreover, the Scheduler runs in a separate daemon that does not listen to process launching tasks, thereby eliminating the possibility of deadlocks that could occur with the old approach.

This is also related to these issues:

Note: this scheduler is designed for WorkGraph only. For WorkChain, this will not work.

Usage

https://aiida-workgraph--275.org.readthedocs.build/en/275/howto/scheduler.html

Scheduler

Add a daemon runner for scheduler:

Keep provenance

Use one scheduler process or scale the number of processes when needed.

While a single scheduler suffices for most use cases, scaling up the number of schedulers may be beneficial when significantly increasing the number of task workers (created by verdi daemon start). A general rule is to maintain a ratio of less than 5 workers per scheduler.

Circus

Similar to the worker daemon, we use circus to manage the scheduler daemon.

command

Todo

checkpoint

how do we save the checkpoint? instead of saving all data every time, it would be great if we only update the context related with the workgraph.

solution 1

save the ctx data for a workgraph to the extras of that workgraph.

submit calcfuntion

I tested, one can submit a calcfunction if it is inside a package, thus the daemon can load it back using importlib.import_module. For calcfunction defined on-the-fly, it will raise an error.

Other features after this PR

codecov-commenter commented 3 months ago

Codecov Report

Attention: Patch coverage is 16.40431% with 1009 lines in your changes missing coverage. Please review.

Project coverage is 67.30%. Comparing base (5937b88) to head (8299fe5). Report is 60 commits behind head on main.

Files with missing lines Patch % Lines
aiida_workgraph/engine/scheduler/scheduler.py 12.00% 814 Missing :warning:
aiida_workgraph/engine/scheduler/client.py 25.16% 116 Missing :warning:
aiida_workgraph/engine/override.py 21.87% 25 Missing :warning:
tests/conftest.py 22.22% 14 Missing :warning:
aiida_workgraph/tasks/test.py 23.07% 10 Missing :warning:
aiida_workgraph/engine/utils.py 61.90% 8 Missing :warning:
aiida_workgraph/workgraph.py 42.85% 8 Missing :warning:
tests/test_scheduler.py 52.94% 8 Missing :warning:
aiida_workgraph/utils/control.py 33.33% 6 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #275 +/- ## ========================================== - Coverage 75.75% 67.30% -8.45% ========================================== Files 70 70 Lines 4615 6123 +1508 ========================================== + Hits 3496 4121 +625 - Misses 1119 2002 +883 ``` | [Flag](https://app.codecov.io/gh/aiidateam/aiida-workgraph/pull/275/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam) | Coverage Δ | | |---|---|---| | [python-3.11](https://app.codecov.io/gh/aiidateam/aiida-workgraph/pull/275/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam) | `67.23% <16.40%> (-8.43%)` | :arrow_down: | | [python-3.12](https://app.codecov.io/gh/aiidateam/aiida-workgraph/pull/275/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam) | `67.22% <16.40%> (?)` | | | [python-3.9](https://app.codecov.io/gh/aiidateam/aiida-workgraph/pull/275/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam) | `67.24% <16.33%> (-8.50%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiidateam#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.