CatalystCode / VoTT-worker

VoTT training and prediction queue consumer
Apache License 2.0
4 stars 2 forks source link

Consider re-organizing/control-inverting orchestrator and plugin #5

Open jcjimenez opened 6 years ago

jcjimenez commented 6 years ago

Definition of done: We arrive at a decision on whether traind.py calls implementation-specific plugins like retinanet/plugin.py (as it does today) or if implementation-specific workers embed dispatch logic (via a Python egg + separate Docker images depending on what kind of training is to be done).

Background: A traind.py worker may be asked to do training for

c-w commented 6 years ago

I took a look at the project and agree that the inverted orchestration will likely be a more broadly usable approach.

There are several limitations with the current approach:

We could do something like this to invert the control:

Architecture diagram

The messages on the training start queue could look something like this:

{
  "trainingImagesContainer": "images-3-15-2018",
  "trainingVmSku": "ND6",
  "trainingImage": "cwolff/my_keras_model",
  "trainingBaseModelBlobId": "models/model-3-14-2018",
  "requestId": "c3feb8b2-287e-11e8-b467-0ed5f89f718b"
}

The messages on the training end queue could look something like this:

{
  "trainingVmId": "aee41e5e-287e-11e8-b467-0ed5f89f718b",
  "requestId": "c3feb8b2-287e-11e8-b467-0ed5f89f718b"
}

If we want more progress insights, we can introduce a shared Table Storage into which the various pieces of the pipeline can report their progress against the shared requestId, e.g. "trainingVmCreated", "imagesDownloaded", "modelIteration1Done", "trainingSuceededWithImages", "trainingVmDestroyed", etc.

Thoughts, @jcjimenez @hxlnt @michaelperel?