Quantum-Accelerators / quacc

quacc is a flexible platform for computational materials science and quantum chemistry that is built for the big data era.
https://quantum-accelerators.github.io/quacc/
BSD 3-Clause "New" or "Revised" License
157 stars 43 forks source link

Support Prefect 3 #2221

Open Andrew-S-Rosen opened 3 weeks ago

Andrew-S-Rosen commented 3 weeks ago

Details about the quacc environment

What is the issue?

When using Prefect 3.x, most dynamic workflows in quacc become broken.

How can we easily reproduce the issue?

pip install quacc[dev]
pip install prefect==3.0.0rc2
from quacc import flow, job, subflow

@job
def add(a, b):
    return a + b

@job
def make_more(val):
    return [val] * 3

@subflow
def add_distributed(vals, c):
    return [add(val, c) for val in vals]

@flow
def dynamic_workflow(a, b, c):
    result1 = add(a, b)
    result2 = make_more(result1)
    return add_distributed(result2, c)

assert [r.result() for r in dynamic_workflow(1, 2, 3)] == [6, 6, 6]

Alternatively, you can run pytest tests/prefect/test_syntax.py to find more examples.

In running the above minimal example, you'll find that it stalls completely (if run via pytest) or simply returns an infinite loop of errors (if run in an IPython console or REPL).

Any Ideas Why?

The root cause is the following monkeypatch:

https://github.com/Quantum-Accelerators/quacc/blob/797602a16a3685741a8db767f19e150a17360d39/src/quacc/__init__.py#L47-L60

If we comment out PrefectFuture.__getitem__ = _patched_getitem, the minimal example will run to completion, albeit at the expense of other tests failing since we have removed the implicit deferral of __getitem__ calls.

I believe the problem is likely related to the fact that there are actually several new Perfect future classes like a PrefectConcurrentFuture among others. However, I tried playing around with them and wasn't able to make much progress from a quick effort.

Tagging @zulissimeta as a head's up (and in case you're feeling adventurous...).

Andrew-S-Rosen commented 2 weeks ago

The entire test suite runs to completion if the following small change is made: changing @job to @task on line 49 below.

https://github.com/Quantum-Accelerators/quacc/blob/56a923af49ca3c4f0ae3e20672de41c3e383c9ff/src/quacc/__init__.py#L44-L56

In other words, it works perfectly fine if we don't .submit() the __getitem__ call to the concurrent task runner, which the @job decorator does automatically. However, it's likely this is not a mergeable change because if we use @task here, then the __getitem__ call will run server-side. This has two potential drawbacks: 1) the server needs the same dependencies as the remote machine; 2) If there are many __getitem__ calls being made, there may be a bottleneck. However, I'm not 100% sure if it's a problem in practice. At least with the SLURM-based DaskTaskRunner, I'm pretty sure the login node and compute node need the dependencies. Additionally, relying on __getitem__ on the login node isn't a huge issue because if the user decides to submit one-Slurm-job-per-task, they wouldn't want such trivial calls to go to Slurm anyway. There is the question about what happens with a remote server setup though.