axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.22k stars 616 forks source link

Entity name skews intent detection results #370

Closed qwertyuu closed 4 years ago

qwertyuu commented 4 years ago

Thanks for the nice library Jesús, very nice work again.

Describe the bug

If you write the name of an entity in the utterance, you actually score almost perfect score in an intent that uses it in its source utterances.

To Reproduce Example file to reproduce. I have commented near each scenario what is expected when the result is bad.

import { NlpManager } from "node-nlp";

const locale = "fr";
const entityText = "des bizarre de problemes";
const entityName = "event";
const nlpManager = new NlpManager({ languages: [locale] });

nlpManager.addNamedEntityText(
    entityName,
    "problems",
    [locale],
    [entityText],
);
nlpManager.addDocument(locale, `J'ai ${entityText}`, "problems");

nlpManager.train();

nlpManager.process(locale, entityName)
    .then((result) => console.log(result));
// { label: 'problems', value: 0.8002170705300266 }
// VERY BAD. This should absolutely give us a 'None' with score 1.

nlpManager.process(locale, `J'ai ${entityName}`)
    .then((result) => console.log(result));
// { label: 'problems', value: 1 }
// VERY BAD. This should absolutely give us a 'None' with score near to 1.

nlpManager.process(locale, `J'ai ${entityText}`)
    .then((result) => console.log(result));
// { label: 'problems', value: 1 }
// Exactly what is expected. Perfect.

nlpManager.process(locale, entityText)
    .then((result) => console.log(result));
// { label: 'problems', value: 0.8002170705300266 },
// Exactly what is expected. Perfect.

nlpManager.process(locale, "J'ai hello")
    .then((result) => console.log(result));
// { label: 'problems', value: 0.9083365028763193 },
// Not sure about this one. It might be biased by the fact there is only one intent in the whole nlpmanager.

We can also see, in the resulting model.nlp, that the stemDict is not sane at all:

        "stemDict": {
          "ai,event,j": {
            "domain": "default",
            "intent": "problems"
          },
          "ai,ai,bizarr,de,de,event,j,j,problem": {
            "domain": "default",
            "intent": "problems"
          }
        }

the event stem should not really be there at all. In no way I want 'event' to trigger any stemming. 'event' also comes around in many other places in the resulting model.nlp where it does not belong (outside of the namedEntities zone)

Expected behavior

I think one part of the expected behavior is that the stemDict should contain something like:

          "ai,bizarr,de,de,j,problem": {
            "domain": "default",
            "intent": "problems"
          }

event should appear nowhere but in the "namedEntities" zone of the model. Only examples with actual source utterance entity text should be used to generate features and stems.

Desktop (please complete the following information):

Additional context This causes many high level problems, because in the process mechanism, nlp.js calls generateEntityUtterance which adds placeholder text for entity names in the optionalUtterance const. This then triggers some unwanted intents since the generated utterance now contains the entity name, which triggers this bug above and wrecks the outcome totally in some cases.

etiennellipse commented 4 years ago

@qwertyuu I ran your case against 4.0.0-rc19. The results are better.

const nlp = require('node-nlp');

const locale = "fr";
const entityText = "des bizarre de problemes";
const entityName = "event";
const nlpManager = new nlp.NlpManager({ languages: [locale] });

nlpManager.addNamedEntityText(
    entityName,
    "problems",
    [locale],
    [entityText],
);
nlpManager.addDocument(locale, `J'ai ${entityText}`, "problems");

nlpManager.train().then(() => {
    nlpManager.save('reproduce-model.nlp', false);

    nlpManager.process(locale, entityName)
        .then((result) => console.log(result));
    // { intent: 'None', score: 1 }
    // Good 👍 

    nlpManager.process(locale, `J'ai ${entityName}`)
        .then((result) => console.log(result));
    //    [ { intent: 'problems', score: 0.6354737923397036 },{ intent: 'None', score: 0.3645262076602964 } ]
    // This looks okay, since the corpus is very small

    nlpManager.process(locale, `J'ai ${entityText}`)
        .then((result) => console.log(result));
    // { intent: 'problems', score: 1 }
    // Exactly what is expected. Perfect. 👍 

    nlpManager.process(locale, entityText)
        .then((result) => console.log(result));
    // { intent: 'None', score: 1 }, 
    // Did not match the intent at all, but matched the entity correctly. This seems acceptable, although maybe we would want to have the intents containing this entity get a certain weight.

    nlpManager.process(locale, "J'ai hello")
        .then((result) => console.log(result));
    // [ { intent: 'problems', score: 0.6354737923397036 }, { intent: 'None', score: 0.3645262076602964 } ]
    // Looks okay
});

Also, the stemDict found in the model is sane:

"stemDict": {
          "ai,bizarr,de,j,problem": {
            "intent": "problems",
            "domain": "default"
          }

@jesus-seijas-sp Was there a fix in between 3.10 and 4.0 that explains this?

jesus-seijas-sp commented 4 years ago

Hello, is not that between 3.10 and 4.0 is there a fix, is more that 4.0 was totally rebuilt, so many things changes. Right now seems to be stable and released as 4.0.1, right now I'm working on the documentation and examples, you can find here a QuickStart guide of the 4.x: https://github.com/axa-group/nlp.js/blob/master/docs/v4/quickstart.md The QuickStart is more a "tutorial" of the different features, and includes links to the code of every example. Tomorrow I will write a guide but about how to compile it for web and react native (the bundle size was reduced from 3MB at version 3.x down to 125KB at version 4.x). After this guide I will revise the issues, this one included.

qwertyuu commented 4 years ago

I confirm that in v4 this problem is pretty much inexistant. I think (I have not tested it) there might still be a vulnerability if a user purposely types the full entity name in the utterance as-is (ex.: %event%) but this is a little far-fetched and can be either ignored or sanitized in some way.