Performance Issue while extracting entities

Devanshu15 commented 3 years ago

Summary

We started using node-nlp with version 3, later realized that because of huge amount of data in production its taking time in extracting the entities. Given a string of 7 different entities it returned result in almost 20 seconds, this becomes a blocker for us as it is also blocking the other requests on the server. So we have to turn off the NLP feature on PROD. And we started looking into the problem. It turns out that length of the utterance is directly proportional to time taken by NlpManager to process and time taken is also directly proportional to the data we have in NLP model file.

Below are the some numbers we get while debugging: nlp> 01211 80003 00001 90419 90797 99746 50162 info: [nlp] language 0 info: [nlp] extractEntities 23679 info: [nlp] nonOverlapping 23679

Above utterance is nothing but a list of user id's.

After going through most of the performance related issues and debugging we decided to move to the latest version. So now we are using the latest 4.16.0 version and it is fast but we are still not very sure of deploying it on prod. The latest test results are as below:

nlp> 01211 80003 00001 90419 90797 99746 50162 info: [nlp] language 0 info: [nlp] extractEntities 1733 info: [nlp] nonOverlapping 1734

Our .nlp model file has almost 113K different entities. we are adding 50 intents, mostly having a length of 5 -6 words with entities substitution and we are using single language as of now.

Below is the code we are using to add entities/intents.

async trainModel(namedEntitiesFactories) {
    this.logger.info('[nlp] training model');
    for (let entitiesFactory of namedEntitiesFactories) {
      const entities = await entitiesFactory();
      for (let entity of entities) {
        this.logger.info(
          `[nlp] adding entity: ${entity.type}, ${entity.option}, ${entity.texts}`,
        );
        this.nlpManager.addNamedEntityText(
          entity.type,
          entity.option,
          entity.languages,
          entity.texts,
        );
      }
    }
    this.nlpManager.addRegexEntity('freeText', 'en', /"(.*?)"/);

    const domains = new Set();
    for (let intent of this.intents) {
      if (!domains.has(intent.domain)) {
        domains.add(intent.domain);
        this.logger.info(`[nlp] assigning domain: ${intent.name}, '${intent.domain}'`);
        this.nlpManager.assignDomain(intent.language, intent.name, intent.domain);
      }
      this.logger.info(`[nlp] loading document: ${intent.name}, '${intent.utterance}'`);
      this.nlpManager.addDocument(intent.language, intent.utterance, intent.name);
    }
    return this.nlpManager.train();
  }

and here how we are extracting entities/intents.

    const language = this.nlpManager.guessLanguage(utterance);
    const allEntities = await this.nlpManager.extractEntities(language, utterance);
    let entities = this.nonOverlappingEntities([...allEntities.entities].reverse());
    entities = this.normalizeEntities(entities);
    let fullText = this.extractFullText(language, utterance, entities);

    let allIntents = [];
    let intents = [];
    if (fullText) {
      allIntents = await this.nlpManager.classify(language, utterance);
      intents = this.likelyIntents([...allIntents.classifications]);
      if (intents.length) {
        fullText = '';
      }
    }

Request you to please help us validate our implementation, as in, is our approach to add entities/intents and process the utterances correct?
What could be the other ways to improve performance?

Please let me know if you need more information.

Environment

Software	Version
`nlp.js`	4.16.0
`node`	14.4.0
`npm`	6.14.5

jesus-seijas-sp commented 3 years ago

Hello,

There is a Huge NER example with all the airports of the world, resolving the entities in milliseconds. https://github.com/axa-group/nlp.js/blob/master/examples/06-huge-ner/index.js

This was done based on this issue: https://github.com/axa-group/nlp.js/issues/337

The secret: https://github.com/axa-group/nlp.js/blob/master/examples/06-huge-ner/conf.json#L18

NER accepts a threshold, by default is 0.8, to allow users to make mistakes when writting, so "Bracelona" will be understood as "Barcelona". But this implementation uses levenshtein distance, and this makes the time exponential based on the amount of entities and length of the sentence.

If the threshold is 1, then it uses a dictionary and not levenshtein, and even 1 million of entity values is resolved in milliseconds.

On the other hand, by default the golden entities are resolved using Microsoft Recognizers, that uses a lot of slow regular expressions.

So, as I understand from your issue that you use the old way (NlpManager):

  const manager = new NlpManager({ ner: { builtins: [], threshold: 1 });

With the empty array on builtins, it will not use the golden entities. With the threshold 1, it will not use levenshtein distance.

And if you want to extract numbers... there is a builtin-default and builtin-compromise options, that are faster than the builtin-microsoft. Or if you have a duckling server, you have the builtin-duckling.

Devanshu15 commented 3 years ago

Thanks Jesus for your prompt response. I understand what you are saying but we don't want to make the threshold 1 as we have other use cases as well, where we expect users to make mistakes in typing names etc. We also don't have a duckling server. I should have included the config, in first place itself, please find it below-

const nlpManager = new NlpManager({
    languages,
    nlu: {
      useNoneFeature: true,
      spellCheck: true,
    },
    ner: { threshold: 0.9, builtins: ['Email'] },
    autoSave: false
  });

What do you suggest now?

jesus-seijas-sp commented 3 years ago

Hello @Devanshu15 ,

As I said, to find similar substrings in an string from a list of potential substrings, will never be a O(1) or O(log n) problem, so the problem will always growth time given the number of substrings that you're checking (your entities). I think that this can be the reason why LUIS, Dialogflow, RASA or other Conversational AI don't tackle the problem of similar entitites.

The algorithm can be improved, but it will take time, and I'm on holidays right now.

So my suggestion is to be pragmatical: what percentage of people can make mistakes when writing? a 5%? Then your choice right now is to target the 95% of people that writes correctly the entities, and improve the algorithm in the future, or directly don't target anybody.

In your code the threshold is 0.9. When the threshold is not 1, it will take exactly the same time if you put any number.

Again, the algorithm can be improved, but will not happen soon, but you can try to optimize it: https://github.com/axa-group/nlp.js/blob/master/packages/ner/src/extractor-enum.js#L90 Any PR improving the times is welcome.

Devanshu15 commented 3 years ago

Yes makes sense, and i agree to what you are saying. Thanks for your help and feel free to close the issue as well.

similonpascal commented 3 years ago

Hi, i found a problem to use :

const manager = new NlpManager({ ner: { builtins: [], threshold: 1 });

If the corpus have regex, we receive this error :

/Users/simpas/node_modules/@nlpjs/ner/src/extractor-enum.js:201
      for (let j = 0; j < current.texts.length; j += 1) {
                                        ^

TypeError: Cannot read property 'length' of undefined
    at ExtractorEnum.buildRuleDict (/Users/simpas/node_modules/@nlpjs/ner/src/extractor-enum.js:201:41)
    at ExtractorEnum.extractFromRule (/Users/simpas/node_modules/@nlpjs/ner/src/extractor-enum.js:255:14)
    at ExtractorEnum.extract (/Users/simpas/node_modules/@nlpjs/ner/src/extractor-enum.js:298:29)
    at ExtractorEnum.run (/Users/simpas/node_modules/@nlpjs/ner/src/extractor-enum.js:317:22)
    at Ner.defaultPipelineProcess (/Users/simpas/node_modules/@nlpjs/ner/src/ner.js:341:45)
    at async Ner.process (/Users/simpas/node_modules/@nlpjs/ner/src/ner.js:370:16)
    at async Ner.generateEntityUtterance (/Users/simpas/node_modules/@nlpjs/ner/src/ner.js:442:13)
    at async Nlp.process (/Users/simpas/node_modules/@nlpjs/nlp/src/nlp.js:595:33)
    at async NlpManager.process (/Users/simpas/node_modules/node-nlp/src/nlp/nlp-manager.js:211:20)
    at async main (/Users/simpas/Desktop/nlp_js/index.js:50:18)

This work with regex if i test the "text" property of rules :

buildRuleDict(rule) {
    const dict = {};
    const inverse = {};
    for (let i = 0; i < rule.rules.length; i += 1) {
      const current = rule.rules[i];
      if ( current.texts ) {
        for (let j = 0; j < current.texts.length; j += 1) {
          const source = current.texts[j];
          const key = this.normalize(current.texts[j]);
          if (!dict[key]) {
            dict[key] = [];
          }
          dict[key].push(current);
          inverse[key] = source;
        }
      }
    }
    rule.dict = dict;
    rule.inverseDict = inverse;
  }

jesus-seijas-sp commented 3 years ago

Fixed and published new version with fix.

Apollon77 commented 2 years ago

@jesus-seijas-sp Added threshold infos to new v4 docs in my PR #1171

aigloss commented 1 year ago

Closing due to inactivity. Please, re-open if you think the topic is still alive.

axa-group / nlp.js

Performance Issue while extracting entities #778

Summary

Environment