This PR is part of a sequence of PR's with name svm training improvement $n that presents few improvements or combination of improvements as attempts to make training faster and consume less memory.
⚠️⚠️ Do not merge this PR as we first need to compare with other attempts first. ⚠️⚠️
Description
This PR mixes #89 with the following improvement:
We currently use a random kfold algorithm to make a grid search and try out hyper parameters of the svm for better performances.
The Random algorithm is fine, but can't result in critical errors when a train set is made of only one class. This scenario will fully break libsvm and there won't be any clean error thrown. As I knew this could happend, I added a min value for k that ensures no train set would ever be made of only one class. This min value can however be quite big when a dataset is really imbalanced. This results in a really long and painfull grid-search.
In this PR I added a Stratified version of the Kfold algorithm which ensures class proportions are preserved in each folds as much as possible.
Checkout <root>/packages/nlu-engine/src/ml/svm/libsvm/kfold/readme.md to learn more about this.
I also took the opportunity to cleanup the kfold folder/library to make sure my futur self will understand this part of the code. It took me a whole day to just get back in the problem and understand what I previsouly understood one year prior. While I'm at it I decided to make it super clear with a proper documentation. Even though it's simple, it can result in a lot of seconds or Gb of RAM added to a training.
Performance
On clinc150 using local lang server with dimension 100:
branch
memory used (mb)
time to train (s)
master
~800
101
this
~700
205
On John Doe* using remote lang server https://lang-01.botpress.io
About
This PR is part of a sequence of PR's with name
svm training improvement $n
that presents few improvements or combination of improvements as attempts to make training faster and consume less memory.⚠️⚠️ Do not merge this PR as we first need to compare with other attempts first. ⚠️⚠️
Description
This PR mixes #89 with the following improvement:
We currently use a random kfold algorithm to make a grid search and try out hyper parameters of the svm for better performances.
The Random algorithm is fine, but can't result in critical errors when a train set is made of only one class. This scenario will fully break libsvm and there won't be any clean error thrown. As I knew this could happend, I added a min value for k that ensures no train set would ever be made of only one class. This min value can however be quite big when a dataset is really imbalanced. This results in a really long and painfull grid-search.
In this PR I added a Stratified version of the Kfold algorithm which ensures class proportions are preserved in each folds as much as possible.
Checkout
<root>/packages/nlu-engine/src/ml/svm/libsvm/kfold/readme.md
to learn more about this.I also took the opportunity to cleanup the kfold folder/library to make sure my futur self will understand this part of the code. It took me a whole day to just get back in the problem and understand what I previsouly understood one year prior. While I'm at it I decided to make it super clear with a proper documentation. Even though it's simple, it can result in a lot of seconds or Gb of RAM added to a training.
Performance
On clinc150 using local lang server with dimension 100:
On John Doe* using remote lang server
https://lang-01.botpress.io