Open jcjimenez opened 6 years ago
I took a look at the project and agree that the inverted orchestration will likely be a more broadly usable approach.
There are several limitations with the current approach:
Having to add new trainers/plugins as sub-modules to this repository will lead to a propagation of forks. This makes it difficult to distribute changes to the main orchestrator code to users. Similarly, the environment for the orchestrator will have to contain the dependencies for all the trainers/plugins which may be mutually exclusive. If we move the trainers/plugins to Docker containers, we avoid this situation.
There also is a simple economic argument to be made: the current architecture requires training to happen on the same machine as the queue monitor. Given that most of the time, we won't be retraining, we'll waste a bunch of money if we want a GPU machine for training. If we invert the control and have a separate orchestrator/queue-monitor that checks for a retraining message and then schedules the training on a separate machine, we can spin up the GPU only on-demand.
We could do something like this to invert the control:
The messages on the training start queue could look something like this:
{
"trainingImagesContainer": "images-3-15-2018",
"trainingVmSku": "ND6",
"trainingImage": "cwolff/my_keras_model",
"trainingBaseModelBlobId": "models/model-3-14-2018",
"requestId": "c3feb8b2-287e-11e8-b467-0ed5f89f718b"
}
The messages on the training end queue could look something like this:
{
"trainingVmId": "aee41e5e-287e-11e8-b467-0ed5f89f718b",
"requestId": "c3feb8b2-287e-11e8-b467-0ed5f89f718b"
}
If we want more progress insights, we can introduce a shared Table Storage into which the various pieces of the pipeline can report their progress against the shared requestId, e.g. "trainingVmCreated", "imagesDownloaded", "modelIteration1Done", "trainingSuceededWithImages", "trainingVmDestroyed", etc.
Thoughts, @jcjimenez @hxlnt @michaelperel?
Definition of done: We arrive at a decision on whether
traind.py
calls implementation-specific plugins likeretinanet/plugin.py
(as it does today) or if implementation-specific workers embed dispatch logic (via a Python egg + separate Docker images depending on what kind of training is to be done).Background: A
traind.py
worker may be asked to do training for