botpress / nlu

This repo contains every ML/NLU related code written by Botpress in the NodeJS environment. This includes the Botpress Standalone NLU Server.
23 stars 21 forks source link

feat(nlu-server): distributed training queue to scale training nlu horizontally #72

Closed franklevasseur closed 3 years ago

franklevasseur commented 3 years ago

Distributed Training queue

Description

Prior to this PR, the NLU server couldn't be deployed behind a load balancer et a multi-cluster fashion. Because every training-state was kept in memory, all requests meant to either get or alter a training state had to be rooted to the exact machine that started the training.

This PR allows such behavior by storing training state in a database when desired and broadcasting requests to cancel a training.

Solves #62

How it works

State machine / behavior

This chart gives an overview:

image

Storage

When no database URL is given to NLU server,

When a database URL is provided, all of the above are stored in the database. Training states are stored in the nlu_trainings table. There's a column for the training set which is stringified and compressed before going in the table. The models are in the ghosts tables.

What's left before merging