Open feltech opened 1 year ago
When I wrote this I had totally forgot about #827. Perhaps this issue should be closed pending that?
Thats just a DR though, so will be closed once we ratify it - so we can use this to track the work to be done in the conveniences.
We previously wrote a DR, but we have since moved away from using the Context as a dumping ground for options that affect specific API calls. Instead, reserving it for "things that categorize the workflow the context represents" (https://github.com/OpenAssetIO/OpenAssetIO/issues/1054). As such its conclusion may no longer be valid:
OpenAssetIO is a batch-first API. Presently the manager is required to provide a result for each element of the batch - the data, or an error.
Results are provided by an element-wise callback mechanism, regardless of how data is retrieved internally.
There are many scenarios in which a host may treat a failure for any element in the batch as a failure of the whole request. For example, when determining the inputs to a render.
Requiring the manager to continue to process remaining elements can result in significant redundant work to being performed, potentially impacting service availability.
At scale, hinting to the manager that a batch is permitted to "fail fast" provides the opportunity to reduce pressure on back-end systems.
This failure mode can not be assumed, as the inverse scenario where individual element failures are expected also exist. When querying the availability of an entity in different scopes, for example.
As the behavour is determined by the host, we need a mechanism to express the desired operational mode as part of an API request.
The scenario I have in mind is a resolver daemon that's implemented as a shared-nothing cluster of processes that each cache positive lookups... cache misses result in a query against a canonical database.
If I send a render to the farm and 500 Nuke processes start up concurrently and each send a batch request of 1000 URIs to the resolver daemon, it might be useful to say - if you fail to resolve a URI, don't bother with the others because either
- I already know they'll fail too.
- I can't do useful work unless I get 100% of my resolves back.
Alternatively my Nuke render might DDOS my database, because we'll get 500 * 1000 resolves hitting it concurrently. Admittedly those numbers would have to be higher to be truly worrying but renderfarms gonna renderfarm. It's the kind of thing that might happen when you send a job to the wrong datacenter and your render says "give me paths local to this datacenter for these 1000 assets" and they haven't been synced there yet.
All OpenAssetIO requests are stateless, and parameterised with all the
data required to service the request. This includes a description of
the calling environment via the Context
object. The Context contains
fields that describe the nature of the hosts intent, including access
(read or write) and the lifetime of query responses (transient,
persisted, etc).
The manager must consider the host intent defined by the Context and adjust its behaviour accordingly (e.g. erroring write access to a read-only entity)
The manager is not required to order the result callbacks in the same order as the input data. Only that a callback must be made for each element before the method returns.
The openassetio.test.manager
harness provides an
apiComplianceSuite
test framework that helps plugin developers
ensure that they fulfill the API contract, and its edge cases.
Do nothing, as it's not a significant enough concern.
Add parameter to relevant methods bool abortOnBatchElementError
.
See the appendix for a description of its effect.
Add a new field to the Context struct bool abortOnBatchElementError
.
See the appendix for a description of its effect.
We will adopt Option 3, and extend the apiComplianceSuite
to ensure
that manger implementations correctly satisfy the resulting API
contract.
abortOnBatchElementError
When true
, the manager should abort at the first element error.
However, any given manager implementation may, or may not be able to batch queries to its back-end. The batch size to the back end may also not match the request batch size. The upshot of this is that when an element error is encountered, there may already be additional sucessful elements already processed.
For reasons of flexibility and performance, The manager is not required to call callbacks in element order.
So what does 'abort' mean in terms of callbacks? The abort mechanism is entirely motivated by peformance and the reduction of redundant work. So, we define the required manager behaviour as:
For read operations, no more callbacks should be made after the first element error is encountered, regardless of any other available success elements that have not yet had their callback invoked. This ensures that a minimum number of callbacks are made. And provides the simplest code path for the manager's implementation.
For write operations, all successfully processed elements should have their callback invoked before the first error element callback is made. After this, no further callbacks should be made. This attempts to minimize the possibility of dangling handles for new entities[^1], whilst still avoiding unnecessary work in the back-end. We believe this concern justifies the additional callback ordering overhead in the manager's implementation.
[^1]: Note that a proper transaction mechanism is scheduled for future addition to the API.
Currently, we believe a "skipped" BatchElementError will be zero initialized, which is a problem as it'll fall through a switch statement of all the batch element errors as there is no 0 value. Adds more necessity to this.
What
Following on from #848, design the behaviour around aborting when encountering a
BatchElementError
- see https://github.com/OpenAssetIO/OpenAssetIO/pull/827Why
Manager
member functions that throw an exception on encountering aBatchElementError
currently throw inside theBatchElementErrorCallback
instance, propagating to theManagerInterface
that called it, then back out to theManager
and through to the host.variant
-return signatures currently gather allBatchElementError
s, rather than abort on the first encountered.This may not be desirable by hosts, so a "fail fast" option should be added, probably in the
Context
, which will modify this behaviour. In particular, thevariant
-return behaviour should be tweakable. The throwing signatures may also need to change?Acceptance Criteria
Notes
BatchElementError
default construction have akSkipped
(or similar) status.BatchElementError
the default type invariant
returns.std::vector<std::variant<BatchElementError, ...>>
will be a vector of "skipped" states, to be replaced if possible with "real" values.