botpress / nlu

This repo contains every ML/NLU related code written by Botpress in the NodeJS environment. This includes the Botpress Standalone NLU Server.
22 stars 21 forks source link

svm training improvement 3 #91

Closed franklevasseur closed 2 years ago

franklevasseur commented 2 years ago

About

This PR is part of a sequence of PR's with name svm training improvement $n that presents few improvements or combination of improvements as attempts to make training faster and consume less memory.

⚠️⚠️ Do not merge this PR as we first need to compare with other attempts first. ⚠️⚠️

Description

This PR mixes #89 with the following improvement:

We currently use a random kfold algorithm to make a grid search and try out hyper parameters of the svm for better performances.

The Random algorithm is fine, but can't result in critical errors when a train set is made of only one class. This scenario will fully break libsvm and there won't be any clean error thrown. As I knew this could happend, I added a min value for k that ensures no train set would ever be made of only one class. This min value can however be quite big when a dataset is really imbalanced. This results in a really long and painfull grid-search.

In this PR I added a Stratified version of the Kfold algorithm which ensures class proportions are preserved in each folds as much as possible.

Checkout <root>/packages/nlu-engine/src/ml/svm/libsvm/kfold/readme.md to learn more about this.

I also took the opportunity to cleanup the kfold folder/library to make sure my futur self will understand this part of the code. It took me a whole day to just get back in the problem and understand what I previsouly understood one year prior. While I'm at it I decided to make it super clear with a proper documentation. Even though it's simple, it can result in a lot of seconds or Gb of RAM added to a training.

Performance

On clinc150 using local lang server with dimension 100:

branch memory used (mb) time to train (s)
master ~800 101
this ~700 205

On John Doe* using remote lang server https://lang-01.botpress.io

branch memory used (gb) time to train (min)
master ~40 20
this ~1.3 14