botpress / v12

Botpress OSS – v12
https://v12.botpress.com
GNU Affero General Public License v3.0
78 stars 87 forks source link

[CHORE] Serialize Utterances #1143

Closed franklevasseur closed 3 years ago

franklevasseur commented 3 years ago

Is your feature request related to a problem? Please describe. Training can sometimes be very long on big bots. A big part of the a training is ran on different process to prevent from freezing the UI, BUT everytime we run a training we have to load the model which requires to tokenize and vectorize utterances a second time, but this time in the Web process...

Describe the solution you'd like

1) We could add an optionnal parameter to the training that specifies to load model before serializing it. 2) The step of serializing the model, should also serialize Utterance class so it doesnt have to be computed at each load.

This means that Utterance class would look something like this:

class Utterance {
public static serialize(utt: Utterance): UtteranceDto { ... }
public static deserialize(utt: UtteranceDto): Utterance { ... }
...
}
franklevasseur commented 3 years ago

Closing as I experienced few things to make load time faster and I realized it's a bit more complex than I expected:

  1. For our biggest known bot (let's call it Gerry), the added weight of processed intents in training output was at least ~500mb... (I say at least because I tried few configurations)
  2. Adding too much information in training output ends up taking more time to move around the data than to recreate it when loading.
  3. There is a risk that too much data won't make it from the training worker to the web worker (I actually experienced this bug on Gerry in a particular circumstances)