Various attempts at improving training resource consumption in 4 commits:
fix(nlu-engine): launch svm trainings one after the other
Currently, all iterations of the grid search ran during the SVM training are done concurrently. This results in multiple SVM being loaded at the same time in memory, thus consuming lots of RAM. Using a Bluebird mapSeries this PR makes sure, only one SVM is loaded at any time during the grid search.
Unfortunately, because the node-svm binding uses a NAPI:: AsyncWorker, this fixes also reduces training speed.
fix(nlu-engine): launch intent trainings in parallel and log each ctx
We train one intent classifier per context. Currently all those classifiers are trained sequentially. However, there is no arm in training them concurrently as the amount of concurrent trainings is limited by the MLThreadPool class located in ml-thread-pool/index.ts file:
fix(nlu-engine): use a stratified kfold to limit the amount of grid search iterations
We currently use a random kfold algorithm to make a grid search and try out hyper parameters of the svm for better performances.
The Random algorithm is fine, but can't result in critical errors when a train set is made of only one class. This scenario will fully break libsvm and there won't be any clean error thrown. As I knew this could happend, I added a min value for k that ensures no train set would ever be made of only one class. This min value can however be quite big when a dataset is really imbalanced. This results in a really long and painfull grid-search.
In this PR I added a Stratified version of the Kfold algorithm which ensures class proportions are preserved in each folds as much as possible.
Checkout <root>/packages/nlu-engine/src/ml/svm/libsvm/kfold/readme.md to learn more about this.
I also took the opportunity to cleanup the kfold folder/library to make sure my futur self will understand this part of the code. It took me a whole day to just get back in the problem and understand what I previsouly understood one year prior. While I'm at it I decided to make it super clear with a proper documentation. Even though it's simple, it can result in a lot of seconds or Gb of RAM added to a training.
chore(nlu-engine): go back to running svm grid search in parallel
There's no real need to run grid-search in serial as the amount of folds is highly reduced by the last commit.
Performances
On clinc150 using local lang server with dimension 100:
branch
memory used (mb)
time to train (s)
master
~800
101
this
~700
82
On John Doe* using remote lang server https://lang-01.botpress.io
Description
Various attempts at improving training resource consumption in 4 commits:
fix(nlu-engine): launch svm trainings one after the other
Currently, all iterations of the grid search ran during the SVM training are done concurrently. This results in multiple SVM being loaded at the same time in memory, thus consuming lots of RAM. Using a Bluebird
mapSeries
this PR makes sure, only one SVM is loaded at any time during the grid search.Unfortunately, because the
node-svm
binding uses a NAPI:: AsyncWorker, this fixes also reduces training speed.fix(nlu-engine): launch intent trainings in parallel and log each ctx
We train one intent classifier per context. Currently all those classifiers are trained sequentially. However, there is no arm in training them concurrently as the amount of concurrent trainings is limited by the
MLThreadPool
class located in ml-thread-pool/index.ts file:fix(nlu-engine): use a stratified kfold to limit the amount of grid search iterations
We currently use a random kfold algorithm to make a grid search and try out hyper parameters of the svm for better performances.
The Random algorithm is fine, but can't result in critical errors when a train set is made of only one class. This scenario will fully break libsvm and there won't be any clean error thrown. As I knew this could happend, I added a min value for k that ensures no train set would ever be made of only one class. This min value can however be quite big when a dataset is really imbalanced. This results in a really long and painfull grid-search.
In this PR I added a Stratified version of the Kfold algorithm which ensures class proportions are preserved in each folds as much as possible.
Checkout
<root>/packages/nlu-engine/src/ml/svm/libsvm/kfold/readme.md
to learn more about this.I also took the opportunity to cleanup the kfold folder/library to make sure my futur self will understand this part of the code. It took me a whole day to just get back in the problem and understand what I previsouly understood one year prior. While I'm at it I decided to make it super clear with a proper documentation. Even though it's simple, it can result in a lot of seconds or Gb of RAM added to a training.
chore(nlu-engine): go back to running svm grid search in parallel
There's no real need to run grid-search in serial as the amount of folds is highly reduced by the last commit.
Performances
On clinc150 using local lang server with dimension 100:
On John Doe* using remote lang server
https://lang-01.botpress.io