Utterance Subset mega-bug

qwertyuu commented 4 years ago

Hello! We really like working with this library. Thanks for building it, it is very useful and nice to use!

Describe the bug When you specify an entity with a specific string that is an exact subset of another one (see example below), the "optionalUtterance" string contains both and it wrecks the intent detection mechanism.

To Reproduce

const { NlpManager } = require("node-nlp");

const manager = new NlpManager({ languages: ["en"] });
manager.addNamedEntityText(
    "hero",
    "spiderman",
    ["en"],
    ["Spiderman"],
);
manager.addNamedEntityText(
    "test",
    "spiderman",
    ["en"],
    ["I am Spiderman"],
);
manager.addDocument("en", "%test%", "greetings.hello");
// Train and save the model.
(async () => {
    await manager.train();
    const response = await manager.process("en", "I am Spiderman");
    console.log(response);
})();

Output intent is None with a score of 1. and optionalUtterance: '%test%%hero%'

Expected behavior Since I am Spiderman is VERBATIM the only possible way to trigger the test entity, I think greetings.hello should come out with a score of 1 (which is what happens if I delete the hero entity and run the file again)

Possible solutions: If an entity that contains another one matches, DO NOT match the substrings. The optionalUtterance should be '%test%' here.

Desktop (please complete the following information):

OS: Mac
Browser NodeJS
Version 4

jesus-seijas-sp commented 4 years ago

Hello! Bug confirmed, I will take a look on how to fix it. I see several ways:

As this happen because of overlaping of entities, in case of entity overlaping try to get the largest one with equal match. That way the optional utterance generated will be %test%. Unfortunatelly this will make that "I am Spiderman" will not return an entity %hero%. I don't know if this is the expected behaviour, to be honest I neverd had a set of overlapped entities.
Instead of generating only one optional utterance, generate all the possible ones. The problem is that if a sentence contains a lot of entities, that can be too much optional utterances...

I'm more confortable with the first solution. Opinions?

qwertyuu commented 4 years ago

@jesus-seijas-sp Nice thanks for looking into this issue so quick.

Actually, we use the entities for a specific domain goal. I will explain more in detail:

Let's say I want to distinguish between "good" and "very good" because I want to classify some user's feedback as "mediocre", "bad", "good", "excellent". I think "good" would classify to "good" and "very good" would classify as "excellent".

In this case, I think logically the first solution is the best. When I match "very good", I want only to see the "very good" entity in the output, not the "good" one. Maybe this is just a new type of entities (much like trim entities, or enum entities, or regex entities.. this would be a fallback entity with multiple levels of specific: 100% specific = "very good", 50% specific = "good")

Or this could be an option, to turn on when you want overlapping entities to discard less specific ones in the output.

jesus-seijas-sp commented 4 years ago

I updated the code of the reduce edges so when 2 edges (potential entities) are enum and the text of one edge is contained in the text of another edge, then the smaller one is discarded. I will publish a version this night.

jesus-seijas-sp commented 4 years ago

Hello. Published as version 4.1.0

qwertyuu commented 4 years ago

@jesus-seijas-sp Thanks a whole lot. Just updated

axa-group / nlp.js

Utterance Subset mega-bug #388