axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.22k stars 616 forks source link

Utterance Subset mega-bug #388

Closed qwertyuu closed 4 years ago

qwertyuu commented 4 years ago

Hello! We really like working with this library. Thanks for building it, it is very useful and nice to use!

Describe the bug When you specify an entity with a specific string that is an exact subset of another one (see example below), the "optionalUtterance" string contains both and it wrecks the intent detection mechanism.

To Reproduce

const { NlpManager } = require("node-nlp");

const manager = new NlpManager({ languages: ["en"] });
manager.addNamedEntityText(
    "hero",
    "spiderman",
    ["en"],
    ["Spiderman"],
);
manager.addNamedEntityText(
    "test",
    "spiderman",
    ["en"],
    ["I am Spiderman"],
);
manager.addDocument("en", "%test%", "greetings.hello");
// Train and save the model.
(async () => {
    await manager.train();
    const response = await manager.process("en", "I am Spiderman");
    console.log(response);
})();

Output intent is None with a score of 1. and optionalUtterance: '%test%%hero%'

Expected behavior Since I am Spiderman is VERBATIM the only possible way to trigger the test entity, I think greetings.hello should come out with a score of 1 (which is what happens if I delete the hero entity and run the file again)

Possible solutions: If an entity that contains another one matches, DO NOT match the substrings. The optionalUtterance should be '%test%' here.

Desktop (please complete the following information):

jesus-seijas-sp commented 4 years ago

Hello! Bug confirmed, I will take a look on how to fix it. I see several ways:

I'm more confortable with the first solution. Opinions?

qwertyuu commented 4 years ago

@jesus-seijas-sp Nice thanks for looking into this issue so quick.

Actually, we use the entities for a specific domain goal. I will explain more in detail:

Let's say I want to distinguish between "good" and "very good" because I want to classify some user's feedback as "mediocre", "bad", "good", "excellent". I think "good" would classify to "good" and "very good" would classify as "excellent".

In this case, I think logically the first solution is the best. When I match "very good", I want only to see the "very good" entity in the output, not the "good" one. Maybe this is just a new type of entities (much like trim entities, or enum entities, or regex entities.. this would be a fallback entity with multiple levels of specific: 100% specific = "very good", 50% specific = "good")

Or this could be an option, to turn on when you want overlapping entities to discard less specific ones in the output.

jesus-seijas-sp commented 4 years ago

I updated the code of the reduce edges so when 2 edges (potential entities) are enum and the text of one edge is contained in the text of another edge, then the smaller one is discarded. I will publish a version this night.

jesus-seijas-sp commented 4 years ago

Hello. Published as version 4.1.0

qwertyuu commented 4 years ago

@jesus-seijas-sp Thanks a whole lot. Just updated