axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.22k stars 616 forks source link

Compound words in NER #156

Open torava opened 5 years ago

torava commented 5 years ago

Compound words don't work optimally with NER in nlp.js. For example, according to nlp.js, Spiderman is a spider but not a man. Whitespace seems to have excessive significance. According to nlp.js, Spider-Man or Spider Man is certainly a spider and a man. I don't see the point in this separation. Especially in Finnish language this is a critical issue. We have hell of a lot compoundwords. It would be nice if nlp.js could be configured to behave differently in this case.

var {NerManager} = require('node-nlp');
manager = new NerManager({threshold:0.1});
manager.addNamedEntityText('species', 'spider', 'en', ['spider']);
manager.addNamedEntityText('species', 'man', 'en', ['man']);
manager.findEntities('spiderman', 'fi').then(entities => console.log(entities));
manager.findEntities('spider man', 'fi').then(entities => console.log(entities));
manager.findEntities('spider-man', 'fi').then(entities => console.log(entities));

spiderman returns

[ { start: 0,
    end: 8,
    len: 9,
    levenshtein: 3,
    accuracy: 0.5,
    option: 'spider',
    sourceText: 'spider',
    entity: 'species',
    utteranceText: 'spiderman' } ]

spider man and spider-man both return

[ { start: 0,
    end: 5,
    len: 6,
    levenshtein: 0,
    accuracy: 1,
    option: 'spider',
    sourceText: 'spider',
    entity: 'species',
    utteranceText: 'spider' },
  { start: 7,
    end: 9,
    len: 3,
    levenshtein: 0,
    accuracy: 1,
    option: 'man',
    sourceText: 'man',
    entity: 'species',
    utteranceText: 'man' } ]
jesus-seijas-sp commented 5 years ago

Hello, in this case is because of this function: https://github.com/axa-group/nlp.js/blob/master/lib/util/similar-search.js#L135

This function extract where are the word positions based on if it's alphanumeric or not. So perhaps an strategy by language can be implemented, as it's also done in the tokenizers for each language.

One question, can you explain what you expect as result in each case? "spider man", "spiderman" and "spider-man".

Also, for "spiderman" is returning "spider" because the threshold provided is very low, with a threshold >0.5 will return empty array.

torava commented 5 years ago

Hi, I would expect it to return both spider and man. Then you would have better chance to detect whatever new supermen will be invented. However, it's not as practical example than there are in Finnish language:

kana-caesarsalaatti = chicken caesar salad tonnikalapastasalaatti = tuna pasta salad pasta-kinkkusalaatti = pasta salad with ham savukalasalaatti = smoked fish salad savuporosalaatti = smoked reindeer salad lohisalaatti = salmon salad savulohi-vihannessalaatti = salad with smoked salmon and vegetables savustettu lohi ja vihannessalaatti = salad with smoked salmon and vegetables kylmäsavulohisalaatti = cold smoked salmon salad sipuli-perunasalaatti = onion potato salad tomaatti-mozzarellasalaatti = tomato mozzarella salad peruna-broileri-juustosalaatti = potato broiler cheese salad grillikasvis-couscoussalaatti = salad with grilled vegetables and couscous savuhärkä-pastasalaatti = pasta salad with smoked beef lohi-avokadosalaatti = salmon avokado salad kinkku-nuudelisalaatti = ham noodle salad seesamiahvensalaatti = sesam perch salad riisinuudelisalaatti = rice noodle salad valkosipulisalaattikastike = garlic salad dressing yrtti-balsamicosalaattikastike = herb balsamico salad dressing tomaatti-chilisalaattikastike = tomato chili salad dressing

I could try to list all salads or salad dressings or just agree that if a compound word ends with salad then it's a salad or if it ends with salad dressing then it's a salad dressing. Then if I have also list of different ingredients I could tell what kind of salad or salad dressing it is. However, it's not possible now.

var {NerManager} = require('node-nlp');
var manager, entities;
manager = new NerManager({threshold: 0.1});
manager.addNamedEntityText('animal', 'poro', 'fi', ['poro']);
manager.addNamedEntityText('dish', 'salaatti', 'fi', ['salaatti']);
manager.addNamedEntityText('food', 'savuporo', 'fi', ['savuporo']);
manager.addNamedEntityText('process', 'savu', 'fi', ['savu']);
manager.findEntities('savuporosalaatti', 'fi').then(entities => console.log(entities));

returns []

manager.findEntities('porosalaatti', 'fi').then(entities => console.log(entities));

returns

 [{ start: 0,
    end: 11,
    len: 12,
    levenshtein: 4,
    accuracy: 0.5,
    option: 'salaatti',
    sourceText: 'salaatti',
    entity: 'dish',
    utteranceText: 'porosalaatti' } ]

But still I don't know what kind of salad it is. I could add savuporosalaatti as a named entity but that's an endless path to take. Just think that the same work has to be done to any kind of dish: bread, porridge, soup, stew, omelette, sushi, burrito... It would be easier to prepare for any kind of dish than to tell people what kind of dish can they have.

This gets even trickier because savuporo is smoked reindeer but savu alone is smoke. savustettu is smoked. And almost every meat can be smoked. savustettu lohi ja vihannessalaatti would be salad with smoked salmon and vegetables. savustettu lohi-vihannessalaatti would mean that the whole salmon vegetable salad is smoked which is unusual but I won't prevent you to do that either.

Although grammar rules say that correct form is valkosipulisalaattikastike you can sometimes see it written valkosipuli salaattikastike. Therefore, it would be best to have a possibility to find entities with form /(food[\s|-]?)*dish/.