About

This PR is part of a sequence of PR's with name svm training improvement $n that presents few improvements or combination of improvements as attempts to make training faster and consume less memory.

⚠️⚠️ Do not merge this PR as we first need to compare with other attempts first. ⚠️⚠️

Description

This PR mixes #89 with the following improvement:

We currently use a random kfold algorithm to make a grid search and try out hyper parameters of the svm for better performances.

The Random algorithm is fine, but can't result in critical errors when a train set is made of only one class. This scenario will fully break libsvm and there won't be any clean error thrown. As I knew this could happend, I added a min value for k that ensures no train set would ever be made of only one class. This min value can however be quite big when a dataset is really imbalanced. This results in a really long and painfull grid-search.

In this PR I added a Stratified version of the Kfold algorithm which ensures class proportions are preserved in each folds as much as possible.

Checkout <root>/packages/nlu-engine/src/ml/svm/libsvm/kfold/readme.md to learn more about this.

I also took the opportunity to cleanup the kfold folder/library to make sure my futur self will understand this part of the code. It took me a whole day to just get back in the problem and understand what I previsouly understood one year prior. While I'm at it I decided to make it super clear with a proper documentation. Even though it's simple, it can result in a lot of seconds or Gb of RAM added to a training.

Performance

On clinc150 using local lang server with dimension 100:

branch	memory used (mb)	time to train (s)
master	~800	101
this	~700	205

On John Doe* using remote lang server https://lang-01.botpress.io

branch	memory used (gb)	time to train (min)
master	~40	20
this	~1.3	14

John Doe is an internally-known really big bot

botpress / nlu

svm training improvement 3 #91

About

Description

Performance