axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.28k stars 621 forks source link

Overlapping entities in ner output #426

Closed bbgvalayev closed 1 year ago

bbgvalayev commented 4 years ago

Is your feature request related to a problem? Please describe. Hi, I am using ner to highlight words in text. I found a use case where output entities overlap. So here is my code:

const { containerBootstrap } = require("@nlpjs/core");
const { BuiltinMicrosoft } = require('@nlpjs/builtin-microsoft');
const { Ner, ExtractorEnum, ExtractorRegex, ExtractorTrim, ExtractorBuiltin } = require("@nlpjs/ner");

async function main() {
    const container = await containerBootstrap();
    const builtin = new BuiltinMicrosoft({ threshold: 0.8 });

    container.register('extract-builtin-??', builtin, true);
    container.use(ExtractorEnum);
    container.use(ExtractorRegex);
    container.use(ExtractorTrim);
    container.use(ExtractorBuiltin);

    const ner = new Ner({ container });
    const actual = await ner.process({
        text: "Canada’s stock market surged for a second day as investors saw hope that government spending plans will bolster a global economy hit by the coronavirus pandemic.",
        locale: "en"
    });

    console.log(actual.entities);
}

main();

And here is output when you run it:

[ { start: 35,
    end: 40,
    len: 6,
    accuracy: 0.95,
    sourceText: 'second',
    utteranceText: 'second',
    entity: 'ordinal',
    resolution: { strValue: '2', value: 2, subtype: 'integer' } },
  { start: 33,
    end: 40,
    len: 8,
    accuracy: 0.95,
    sourceText: 'a second',
    utteranceText: 'a second',
    entity: 'duration',
    resolution: { values: [Array] } } ]

Describe the solution you'd like I would like a way to not have these overlappings since it makes things difficult when highlighting that entity in original text. May be there is a smarter way of doing this in the first place?

Describe alternatives you've considered NA

Additional context NA

Apollon77 commented 2 years ago

The question is what to do in such a case.

v3 had a logic to find overlappings between matched builtin entities and trim entities. I reintroduced this feature in my PR https://github.com/axa-group/nlp.js/pull/1171/commits/264cb4bc5f94d92d1c9dc33fd76adfd41733f602 ... but these are two builtin entities that overlap.

Which one should win? ;-)

aigloss commented 1 year ago

Closing due to inactivity. Please, re-open if you think the topic is still alive.