axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.27k stars 620 forks source link

Utterance with small length triggered with high score even if its words never in the intent (false positive) #286

Closed mmayla closed 4 years ago

mmayla commented 5 years ago

Describe the bug I train the model on huge intents (+11000) in Arabic, all is working great, except for the fact that the model doesn't capture false positives in high rate with also high confidentiality score.

Utterance with no words in the entire intents list got captured with a high score, especially small utterances.

Note: I made sure that useNoneFeature is true (left it to default)

Desktop (please complete the following information):

jesus-seijas-sp commented 5 years ago

Hello! By default the useNoneFeature is false for arabic... https://github.com/axa-group/nlp.js/blob/master/lib/nlp/nlp-util.js#L428

You can activate it by putting

NlpUtil.useNoneFeature.ar = true

at the beginning of your code.

Is deactivated because I don't have any good and big dataset in arabic to test... so I didn't feel confident enough.

mmayla commented 5 years ago

@jesus-seijas-sp Thank you for this, you helped me a lot :smiley:

Do you have a format for the dataset needed for testing? I may be able to help you with that

jesus-seijas-sp commented 5 years ago

Hello! here you have an example: https://github.com/axa-group/nlp.js/tree/master/examples/benchmark

The json contains the intents, for each intent the utterances to train and the utterances to test. This corpus is an example in english, with nlp.js the accuracy is >98%, in other providers... well.. better check ;)

The None intent is special, does not haves data to train, only to test, and is the place to put sentences that can generate a false positive.

mmayla commented 5 years ago

@jesus-seijas-sp I have been using... other providers and nlp.js for a while now and we settled on nlp.js... I know how nlp.js rock haha ;)

Most of my projects are in Arabic and we use nlp.js by default now in any project, so I am interested in improving nlp.js Arabic support and will create corpus like the one you provided in standard Arabic after I finish it what to do? do I add it to examples/benchmark and create a pull request? or do you suggest another way?

Also as you may know there are more than 25 dialects of Arabic used by people (Standard Arabic, Egyptian Arabic, Gulf Arabic, Tunisian Arabic, Levantine Arabic, ...) with standard Arabic being the least one got used in real-world but it mostly the standard language that all people can understand. Egyptian Arabic and Gulf Arabic are the most used ones, especially digitally.

What do you suggest for adding such dialects to nlp.js?

Thanks for the help :smiley:

jesus-seijas-sp commented 5 years ago

Hello @MMayla , For each language three things should be implemented:

jesus-seijas-sp commented 4 years ago

Closing as it was solved