botpress / nlu

This repo contains every ML/NLU related code written by Botpress in the NodeJS environment. This includes the Botpress Standalone NLU Server.
22 stars 21 forks source link

fix(nlu-engine): only few languages are space separated #114

Closed franklevasseur closed 2 years ago

franklevasseur commented 2 years ago

Alright, this PR fixes a bug where:

  1. The bot had very few spaces in its train set
  2. The train set was considered non-space separated by the algorithm responsible to generate none utterances
  3. Only one utterance had more than 3 tokens
  4. The KFold algorithm threw an error because it could only make train sets with one class

Now, only the following languages are considered not space-separated:

This list comes from this linguistic Stack Exchange Thread.

This fix does not prevent the bug to happen ever again, but it makes it even less likely (It was already very unlucky).

Also, this fix preserves the exact same behavior for almost all current bots.

EFF commented 2 years ago

LGTM !