Bugs DEV-2258 and DEV-2259 were hard to reproduce and I couldn't figure out why they were occurring, so I decided to wipe out the code with a bulldozer. You can say this PR is an attempt to:
fix some bugs
make the code simpler
take some ownership in the studio by writing code I actually want to work with
Reviewing
There is little to no value in reviewing the code in Github. I strongly suggest reading the code in vscode instead.
The following files contain the core logic of this PR (state-machine)
packages/studio-be/src/studio/nlu/bot/index.ts: core logic of the training state-machine
The following files are worth a quick look (entry points)
packages/studio-be/src/studio/nlu/index.ts
packages/studio-be/src/studio/nlu/nlu-router.ts: HTTP API of the NLU in Studio Backend
packages/studio-be/src/studio/nlu/nlu-service.ts: entry point of the business logic
How it works
start training
When training starts, studio-be keeps a training entry in its local DB. A training entry allows to map a botId and language to a modelId and definition hash:
Once training starts, studio-be polls the training state and sends it through the web socket. The polled function is syncAndGetState(). This function is the exact same one called when studio-ui gets the training/model state. Studio-be stops polling when training stops.
get training/model state
When studio-ui gets the training/model state (syncAndGetState()), studio-be starts by checking if it has a training-entry in its local DB.
If there is a local training entry, studio-be fetches nlu-server to get the actual state of the training. The following rule is then used to map the status before returning:
if NLU Server responds that training is "done", the train entry is deleted, a model entry is set/upserted and the bot config is updated with the model. This is why the function is called "syncAndGetState()" instead of only "getState()"
If there's no local training entry, there's no way fetching NLU Server for training state because the modelId is unknown. In this case, studio-be falls back on the model.
Studio-be returns "done" if there is a local model entry and the model exists on NLU Server and the model is not dirty
Else it return "needs-training"
Worth mentioning
Model entries could be kept in bot.config.json instead of in the database, but this mean studio would write dataset hash in the config (which might look weird)
If Studio ever becomes a desktop app that can't be used in a cluster configuration, train entries will be kept in memory instead of in database. The only drawback, is that if studio dies during a training, the training is lost (which is allright).
Rewrite the NLU Training State Machine in Studio
Description
should fix:
Bugs DEV-2258 and DEV-2259 were hard to reproduce and I couldn't figure out why they were occurring, so I decided to wipe out the code with a bulldozer. You can say this PR is an attempt to:
Reviewing
There is little to no value in reviewing the code in Github. I strongly suggest reading the code in vscode instead.
The following files contain the core logic of this PR (state-machine)
packages/studio-be/src/studio/nlu/bot/index.ts
: core logic of the training state-machinepackages/studio-be/src/studio/nlu/bot/bot-state.ts
The following files are worth a quick look (entry points)
packages/studio-be/src/studio/nlu/index.ts
packages/studio-be/src/studio/nlu/nlu-router.ts
: HTTP API of the NLU in Studio Backendpackages/studio-be/src/studio/nlu/nlu-service.ts
: entry point of the business logicHow it works
start training
syncAndGetState()
. This function is the exact same one called when studio-ui gets the training/model state. Studio-be stops polling when training stops.get training/model state
When studio-ui gets the training/model state (
syncAndGetState()
), studio-be starts by checking if it has a training-entry in its local DB.If there is a local training entry, studio-be fetches nlu-server to get the actual state of the training. The following rule is then used to map the status before returning:
if NLU Server responds that training is
"done"
, the train entry is deleted, a model entry is set/upserted and the bot config is updated with the model. This is why the function is called"syncAndGetState()"
instead of only"getState()"
If there's no local training entry, there's no way fetching NLU Server for training state because the modelId is unknown. In this case, studio-be falls back on the model.
Studio-be returns
"done"
if there is a local model entry and the model exists on NLU Server and the model is not dirtyElse it return
"needs-training"
Worth mentioning
bot.config.json
instead of in the database, but this mean studio would write dataset hash in the config (which might look weird)