Deterministic, documented order of the controller started, finished and aborted callbacks.

johann-petrak commented 6 years ago

Currently the controllerExecutionStarted callback is invoked first on the original, "template" controller, then in order on all the duplicates that were added to the pool, in a single thread, one after the other.

However, the controllerExecutionFinished/Aborted callbacks can occur in any order because the iteration happens over the queue at the time of termination. It may be useful to invoke these callbacks in either the same order or maybe even better in reverse order, so that the one for the original template controller gets invoked last. This should be easy to implement since there is a list which contains the controllers in the order in which they were created.

The details and guarantees related to the controllerExecutionXXX callbacks should get documented here and also as requirements for other GCP-like tools in the ControllerAwarePR interface.

ianroberts commented 6 years ago

While you can put these kind of guarantees in something like GCP where we know all the duplicates are going to be created in one go up front, I think it would be a mistake to mandate them as part of the ControllerAwarePR interface. Consider situations like a web app with a pool of controllers that grows and shrinks dynamically to match the server load. I suppose they can be documented as "best practice" at least.

Alternatively, it may be that what you're trying to achieve here would be better modelled as an OutputHandler rather than as a PR in the pipeline, as an output handler is a singleton by definition and does have a well-defined lifecycle of init called in one thread, then lots of outputDocument calls in the parallel processing threads, then one close call in a single thread at the end.

johann-petrak commented 6 years ago

This is mainly about processing patterns that require over-the-whole-corpus processing AND should work properly with GCP or other tools that use duplication and concurrent processing, like the CorpuStats plugin (similar to Termraider, but can run with GCP) or the LearningFramework (where training in most cases happens after all documents have been processed.

With those, it would be good to have some guarantee or at least convention about what can or cannot run in parallel when the controller callbacks are invoked and maybe also some predictable behaviour about when those callbacks are invoked.

But I can see your point about having this going on in a web/rest server with dynamic allocation of controllers. How is this done currently in the current code for services? Now that it is possible to deliberately suppress the controller started/finished callbacks when execute() is invoked and instead use the invokeControllerExecutionXXX methods when wanted, how does the service do this?

The reason why I thought some guarantees or conventions would be good is that it is not easy to come up with a generic pattern where one makes sure that any kind of concurrency and order in the callbacks / execute calls can occur. But maybe it is unavoidable.

BTW, the output-handler approach is, I think, not usable for most of these situations above, because they all process and hence need access to data that got collected by the PR (collectively by all duplicates, using some shared data structure), while often ignoring the documents themselves.

johann-petrak commented 6 years ago

I guess if guarantees is needed by some PR at all, then it would be those:

the controllerExecutionStarted() callback for the first/template controller is invoked before any other started or execute or finished method is invoked. Reason: this is where shared data that gets updated for all the documents seen by all the duplicates gets initialized. The reason why this does not happen in init() is because in that case, repeated execution of the pipeline within the GATE GUI would not re-initialize the data structures.
the controllerExecutionFinished() callback for the first/template controller is invoked after any other finished/aborted callback or execute(). Reason: if any over-all processing is necessary, it should happen after every thing else has completed, whatever the meaning of "everything else" is, e.g. in a web service it may mean "the last controller that was still there is about the get chucked out".

In the case of a web/rest service these should be not too hard to follow, in the case of GCP or similar programs it would be trivial.

ianroberts commented 5 years ago

For reference, the current behaviour (not particularly planned this way but that's how it works out) does guarantee that the started callback will go to the template before the duplicates, but the order of the finished callbacks is not deterministic as it depends what order they were last given back to the pool after processing their final documents.

GateNLP / gcp

Deterministic, documented order of the controller started, finished and aborted callbacks. #2