Distributed Training queue

Description

Prior to this PR, the NLU server couldn't be deployed behind a load balancer et a multi-cluster fashion. Because every training-state was kept in memory, all requests meant to either get or alter a training state had to be rooted to the exact machine that started the training.

This PR allows such behavior by storing training state in a database when desired and broadcasting requests to cancel a training.

Solves #62

How it works

State machine / behavior

The POST /train endpoint is called which triggers a handler. The endpoint handler does 2 things; it creates a training with status 'training-pending' and stores it along with the training set (intents and entities). The queue main task is triggered.
When a GET /train/:modelId endpoint is called, the server simply gets the training state in the storage. And returns it.
At some point in the state machine or (in case of emergency) at regular time interval, the training queue main task is triggered. This task queries all trainings to get pending ones. If the current max local trainings are not yet reached, the oldest pending training is picked up and started. At this point, zombie trainings are also queued back. A zombie training is a training that hasn't reported progress in more than 5 minutes.
When the engine progress callback is called, training state is updated in storage.
When training is done, canceled or if an error occurred during training, training state is also updated in storage. If training was successful, the model is persisted on file system, replacing previous models of the same user and lang (There's a limit of 1 model per user per lang). The queue main task is also triggered.
An hour after a training is done/canceled/errored, a janitor task is responsible to prune it from storage. At this point, any call to endpoint GET /train/:modelId on canceled or errored trainings will result in a 404. This endpoint however, still returns status 'done' if a model was found in storage, indicating that there has been a successful training in the past.
When a GET /train/:modelId/cancel endpoint is called, the server broadcasts the instruction to all node to make sure the node which contains the current training cancels it.

This chart gives an overview:

Storage

When no database URL is given to NLU server,

models are stored on FS
training-states are stored in-memory
training-sets are stored in-memory also

When a database URL is provided, all of the above are stored in the database. Training states are stored in the nlu_trainings table. There's a column for the training set which is stringified and compressed before going in the table. The models are in the ghosts tables.

What's left before merging

[x] customizable number of training per instance through config
[ ] faster heartbeat in training process to know whether or not a training is a zombie
[x] replace lock mechanism by something lock/mutex based
[x] lot of qa and testing

botpress / nlu