Utterance with small length triggered with high score even if its words never in the intent (false positive)

axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more

MIT License

6.27k stars 620 forks source link

Utterance with small length triggered with high score even if its words never in the intent (false positive) #286

Closed mmayla closed 4 years ago

mmayla commented 5 years ago

Describe the bug I train the model on huge intents (+11000) in Arabic, all is working great, except for the fact that the model doesn't capture false positives in high rate with also high confidentiality score.

Utterance with no words in the entire intents list got captured with a high score, especially small utterances.

Note: I made sure that useNoneFeature is true (left it to default)

Desktop (please complete the following information):

OS: Manjaro Linux
Version 3.9.0

jesus-seijas-sp commented 5 years ago

Hello! By default the useNoneFeature is false for arabic... https://github.com/axa-group/nlp.js/blob/master/lib/nlp/nlp-util.js#L428

You can activate it by putting

NlpUtil.useNoneFeature.ar = true

at the beginning of your code.

Is deactivated because I don't have any good and big dataset in arabic to test... so I didn't feel confident enough.

mmayla commented 5 years ago

@jesus-seijas-sp Thank you for this, you helped me a lot :smiley:

Do you have a format for the dataset needed for testing? I may be able to help you with that

jesus-seijas-sp commented 5 years ago

Hello! here you have an example: https://github.com/axa-group/nlp.js/tree/master/examples/benchmark

The json contains the intents, for each intent the utterances to train and the utterances to test. This corpus is an example in english, with nlp.js the accuracy is >98%, in other providers... well.. better check ;)

The None intent is special, does not haves data to train, only to test, and is the place to put sentences that can generate a false positive.

mmayla commented 5 years ago

@jesus-seijas-sp I have been using... other providers and nlp.js for a while now and we settled on nlp.js... I know how nlp.js rock haha ;)

Most of my projects are in Arabic and we use nlp.js by default now in any project, so I am interested in improving nlp.js Arabic support and will create corpus like the one you provided in standard Arabic after I finish it what to do? do I add it to examples/benchmark and create a pull request? or do you suggest another way?

Also as you may know there are more than 25 dialects of Arabic used by people (Standard Arabic, Egyptian Arabic, Gulf Arabic, Tunisian Arabic, Levantine Arabic, ...) with standard Arabic being the least one got used in real-world but it mostly the standard language that all people can understand. Egyptian Arabic and Gulf Arabic are the most used ones, especially digitally.

What do you suggest for adding such dialects to nlp.js?

Thanks for the help :smiley:

jesus-seijas-sp commented 5 years ago

Hello @MMayla , For each language three things should be implemented:

Tokenizer: as is the same writting rules, the arabic tokenizer should be enough
Stemmer: how to calculate the stem (root) of words. This is the difficult part, but for inflected languages (those that works adding morphemes, like suffixes) is not so difficult. One week ago I uploaded an auto-stemmer that is able to learn how languages inflect, tested with polish the improvements are huge (from 0.53 up to 0.83 of accuracy). But it's better to write the rules of stemming of each language.
Builtin entities: here we use Duckling, that already supports lots of languages, but if your language is not supported you can write the haskell rules. As the arabic rules already exists, you have a good starting point.

jesus-seijas-sp commented 4 years ago

Closing as it was solved