Add stop mechanism when `FahAsynchronousComputeService` exhausts all RUNs in a PROJECT, or all CLONEs in a RUN

dotsdl commented 2 months ago

We currently don't have an explicit stop mechanism in place for the FahAsynchronousComputeService for when it exhausts all $2^{16}$ RUNs in a PROJECT, or all $2^{16}$ CLONEs in a RUN. It remains an assumption that work server will refuse to create a new RUN or CLONE if it cannot, but I suspect this is not something we should be depending on. Currently the behavior of FahAsynchronousComputeService under these conditions remains undefined.

Possible behaviors we could implement for the FahAsynchronousComputeService in the case of any exhaustion include:

The entire service could halt, indicating that it cannot continue with its current set of PROJECTs as configured, requiring administrator intervention.
A cascading approach to using what it can, up to a limit:
- If the CLONEs within a RUN are exhausted, the service could create a new RUN in the same PROJECT and start populating it. This complicates the current model of a RUN corresponding to a Transformation, since that mapping will no longer be one-to-one, but potentially many-to-one, requiring to changes in how the service maintains its index.
- If the RUNs within a PROJECT are exhausted, the next closest PROJECT with a configuration suitable for the given Task could be used. This has the downside that over time there will be drift between the points offered by a PROJECT and the effort required for the Tasks it services, with a wider variance in effort over time in remaining PROJECTs until they are all exhausted.
- If all PROJECTs configured for the service have exhausted their RUNs, the service should halt, indicating that it cannot continue, requiring administrator intervention.
...

There may be additional alternatives.

@jchodera, @sukritsingh, @jcoffland: do you have insights as to what may be most appropriate here, or ideas for a third alternative?

sukritsingh commented 2 months ago

I'm a fan of "simpler is better" with stuff like this - some of my initial thoughts below:

$2^{16}$ is a massive number of RUNs or CLONEs for any single project. Assuming each RUN is a unique transformation, do we foresee this being an issue rapidly (ie within a few months of deployment?)
I think moving away from the one-to-one mapping of each RUN corresponding to a Transformation has the potential to introduce a lot of confusion for a user, so I'd want to see more detail on it before being convinced about that as a viable option.
Migrating RUNs between PROJECTs with different point calculations has the potential to get a lot of complaints from testers about variable effort and inconsistent effort for the same project ID, and I'd like to not deal with that kind of complaint as I'm sure others would, so I'd want to avoid that as much as possible.
What about just migrating to a new project ID with the same point value? I suppose automatically creating new project IDs could be dangerous so maybe the safe move here is to "halt" the service until and administrator gets involved. $2^{16}$ is such a large number I think it'd be good to know how often an administrator would need to spin up a new project....

My brain is in a few directions with faculty applications and other writing tasks right now, so will percolate on this further!

dotsdl commented 1 week ago

I've added hard stop guardrails (option 1) in 847decb. This should at least avoid potential disaster, and will allow us to explore more sophisticated solutions later.

OpenFreeEnergy / alchemiscale-fah

Add stop mechanism when `FahAsynchronousComputeService` exhausts all RUNs in a PROJECT, or all CLONEs in a RUN #15