OpenAssetIO / OpenAssetIO

An open-source interoperability standard for tools and content management systems used in media production.
Apache License 2.0
286 stars 31 forks source link

"Fail fast" `BatchElementError` behaviour #859

Open feltech opened 1 year ago

feltech commented 1 year ago

What

Following on from #848, design the behaviour around aborting when encountering a BatchElementError - see https://github.com/OpenAssetIO/OpenAssetIO/pull/827

Why

This may not be desirable by hosts, so a "fail fast" option should be added, probably in the Context, which will modify this behaviour. In particular, the variant-return behaviour should be tweakable. The throwing signatures may also need to change?

Acceptance Criteria

Notes

feltech commented 1 year ago

When I wrote this I had totally forgot about #827. Perhaps this issue should be closed pending that?

foundrytom commented 1 year ago

Thats just a DR though, so will be closed once we ratify it - so we can use this to track the work to be done in the conveniences.

foundrytom commented 11 months ago

We previously wrote a DR, but we have since moved away from using the Context as a dumping ground for options that affect specific API calls. Instead, reserving it for "things that categorize the workflow the context represents" (https://github.com/OpenAssetIO/OpenAssetIO/issues/1054). As such its conclusion may no longer be valid:

DR016 Hinting batch failure strategy

Background

OpenAssetIO is a batch-first API. Presently the manager is required to provide a result for each element of the batch - the data, or an error.

Results are provided by an element-wise callback mechanism, regardless of how data is retrieved internally.

There are many scenarios in which a host may treat a failure for any element in the batch as a failure of the whole request. For example, when determining the inputs to a render.

Requiring the manager to continue to process remaining elements can result in significant redundant work to being performed, potentially impacting service availability.

At scale, hinting to the manager that a batch is permitted to "fail fast" provides the opportunity to reduce pressure on back-end systems.

This failure mode can not be assumed, as the inverse scenario where individual element failures are expected also exist. When querying the availability of an entity in different scopes, for example.

As the behavour is determined by the host, we need a mechanism to express the desired operational mode as part of an API request.

Relevant data

The scenario I have in mind is a resolver daemon that's implemented as a shared-nothing cluster of processes that each cache positive lookups... cache misses result in a query against a canonical database.

If I send a render to the farm and 500 Nuke processes start up concurrently and each send a batch request of 1000 URIs to the resolver daemon, it might be useful to say - if you fail to resolve a URI, don't bother with the others because either

  • I already know they'll fail too.
  • I can't do useful work unless I get 100% of my resolves back.

Alternatively my Nuke render might DDOS my database, because we'll get 500 * 1000 resolves hitting it concurrently. Admittedly those numbers would have to be higher to be truly worrying but renderfarms gonna renderfarm. It's the kind of thing that might happen when you send a job to the wrong datacenter and your render says "give me paths local to this datacenter for these 1000 assets" and they haven't been synced there yet.

Options considered

Option 1 - No change

Do nothing, as it's not a significant enough concern.

Pros

Cons

Option 2 - Add parameter

Add parameter to relevant methods bool abortOnBatchElementError. See the appendix for a description of its effect.

Pros

Cons

Option 3 - Extend Context

Add a new field to the Context struct bool abortOnBatchElementError. See the appendix for a description of its effect.

Pros

Cons

Outcome

We will adopt Option 3, and extend the apiComplianceSuite to ensure that manger implementations correctly satisfy the resulting API contract.

Appendix - abortOnBatchElementError

When true, the manager should abort at the first element error.

However, any given manager implementation may, or may not be able to batch queries to its back-end. The batch size to the back end may also not match the request batch size. The upshot of this is that when an element error is encountered, there may already be additional sucessful elements already processed.

For reasons of flexibility and performance, The manager is not required to call callbacks in element order.

So what does 'abort' mean in terms of callbacks? The abort mechanism is entirely motivated by peformance and the reduction of redundant work. So, we define the required manager behaviour as:

[^1]: Note that a proper transaction mechanism is scheduled for future addition to the API.

elliotcmorris commented 6 months ago

Currently, we believe a "skipped" BatchElementError will be zero initialized, which is a problem as it'll fall through a switch statement of all the batch element errors as there is no 0 value. Adds more necessity to this.