MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
144 stars 47 forks source link

In-server, policy-based error cycling #682

Closed dotsdl closed 12 months ago

dotsdl commented 3 years ago

Is your feature request related to a problem? Please describe.

Errors, both systematic and random, are an inevitable reality to distributed computing, and QCFractal's current mechanism for handling these is as a human touchpoint. Tasks that have errored can be restarted using FractalClient.modify_tasks(operation="restart", base_result='<record_id>') by a user with task management permissions.

However, when managing many collections with many different error modes across a variety of compute resources, managing these effectively becomes difficult. External automation, such as the "Error Cycling" automation used by OpenFF, has the drawback of needing to wrap failures in extensive external logic to handle them gracefully, and logic that requires state (e.g. a fixed number of retries for particular error types) requires external state storage of its own.

Describe the solution you'd like

Instead of requiring human intervention or extensive wrapping automation for error cycling, one alternative is to make error cycling a feature of the server itself. Policies for what errors to restart, how often, and how many times could be a feature of Collections, and could be mutable by task administrators. Restarts would be fast, would require zero API communication (saving web server resources for actual data requests by users), and would require no external logic in principle.

To avoid abuse, a default policy could be applied to all tasks for restart frequency and count, and an upper limit for each could be a server config setting. For errors that have exceeded their restart count, these could be manually restarted with the client as before. This scheme reduces the need for that human touchpoint to be exercised so frequently.

bennybp commented 12 months ago

Basically implemented! It always needs improvement, and would improve with better error categorization in qcengine, but the basics are there

See https://github.com/MolSSI/QCFractal/blob/main/qcfractal/qcfractal/components/tasks/reset_logic.py