Open johann-petrak opened 6 years ago
While you can put these kind of guarantees in something like GCP where we know all the duplicates are going to be created in one go up front, I think it would be a mistake to mandate them as part of the ControllerAwarePR
interface. Consider situations like a web app with a pool of controllers that grows and shrinks dynamically to match the server load. I suppose they can be documented as "best practice" at least.
Alternatively, it may be that what you're trying to achieve here would be better modelled as an OutputHandler
rather than as a PR in the pipeline, as an output handler is a singleton by definition and does have a well-defined lifecycle of init
called in one thread, then lots of outputDocument
calls in the parallel processing threads, then one close
call in a single thread at the end.
This is mainly about processing patterns that require over-the-whole-corpus processing AND should work properly with GCP or other tools that use duplication and concurrent processing, like the CorpuStats plugin (similar to Termraider, but can run with GCP) or the LearningFramework (where training in most cases happens after all documents have been processed.
With those, it would be good to have some guarantee or at least convention about what can or cannot run in parallel when the controller callbacks are invoked and maybe also some predictable behaviour about when those callbacks are invoked.
But I can see your point about having this going on in a web/rest server with dynamic allocation of controllers. How is this done currently in the current code for services? Now that it is possible to deliberately suppress the controller started/finished callbacks when execute() is invoked and instead use the invokeControllerExecutionXXX methods when wanted, how does the service do this?
The reason why I thought some guarantees or conventions would be good is that it is not easy to come up with a generic pattern where one makes sure that any kind of concurrency and order in the callbacks / execute calls can occur. But maybe it is unavoidable.
BTW, the output-handler approach is, I think, not usable for most of these situations above, because they all process and hence need access to data that got collected by the PR (collectively by all duplicates, using some shared data structure), while often ignoring the documents themselves.
I guess if guarantees is needed by some PR at all, then it would be those:
In the case of a web/rest service these should be not too hard to follow, in the case of GCP or similar programs it would be trivial.
For reference, the current behaviour (not particularly planned this way but that's how it works out) does guarantee that the started callback will go to the template before the duplicates, but the order of the finished callbacks is not deterministic as it depends what order they were last given back to the pool after processing their final documents.
Currently the controllerExecutionStarted callback is invoked first on the original, "template" controller, then in order on all the duplicates that were added to the pool, in a single thread, one after the other.
However, the controllerExecutionFinished/Aborted callbacks can occur in any order because the iteration happens over the queue at the time of termination. It may be useful to invoke these callbacks in either the same order or maybe even better in reverse order, so that the one for the original template controller gets invoked last. This should be easy to implement since there is a list which contains the controllers in the order in which they were created.
The details and guarantees related to the controllerExecutionXXX callbacks should get documented here and also as requirements for other GCP-like tools in the ControllerAwarePR interface.