axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.22k stars 616 forks source link

Example of how to build and train custom NER models #338

Open claytongulick opened 4 years ago

claytongulick commented 4 years ago

Is your feature request related to a problem? Please describe. I'm working on identifying certain medical terms and phrases which are very specific to the medical industry. I need to be able to create NER models.

Describe the solution you'd like nlp.js has some great built-in entity recognizers, date, email, etc... but it's not clear (to me, as a beginner) on how to build your own that will work with the framework. I'd like to see some clear examples on how to create these models, how to label documents to train the recognition engine, and how to use/save the trained models.

jesus-seijas-sp commented 4 years ago

Hello! Now in the version 4 all is splitted into plugins and pipelines. The entity extraction pipeline is located here, but can be modified by a configuration file: https://github.com/axa-group/nlp.js/blob/master/packages/ner/src/ner.js#L54

      [
        '.decideRules',
        'extract-enum',
        'extract-regex',
        'extract-trim',
        'extract-builtin',
      ],

So as you can see it execute those plugins in this order: extract-enum, extract-regext, extract-trim and extract-builtin. In fact for extract-builtin you can right now decide to register the plugin for Microsoft Recognizers or the one for duckling that are located in these packages:

https://github.com/axa-group/nlp.js/tree/master/packages/builtin-duckling https://github.com/axa-group/nlp.js/tree/master/packages/builtin-microsoft

Even more, you can decide by language which plugins to use or which pipelines to use. You can see a clean example of how to build an extractor plugin taking a look into the regex extractor: https://github.com/axa-group/nlp.js/blob/master/packages/ner/src/extractor-regex.js

That means that if you register a plugin with the same name, it will replace the existing one, so you can replace completely how to do the NER, regex, trim and builtin. Also means that you can modify the pipelines and remove the steps that you don't need and add new ones (put the name of the plugin to execute, and register the plugin).

About how to do that, as we want to be retrocompatible with the version 3.x that used the builtin of microsoft or duckling based on a configuration passed to the NlpManager class, you can see how we did it in version 4 here: https://github.com/axa-group/nlp.js/blob/master/packages/node-nlp/src/nlp/nlp-manager.js#L49

keyvez commented 4 years ago

Is it possible to train a customer NER, for example, if I want this question answered.

"Tell me about %attribute% of Tesla Model S."

%attribute% could be a long list of things such as [color, seats, weight, ...] but not limited at the time

How do I create an entity extractor specifically for that and pass it down to the NlpManager?

Apollon77 commented 2 years ago

In fact you could use a trim rule where you device words before/after your word and then it is trimmed out of the string. You can also ztyr to use a enum entity AND a trim rule ... The first should give better matching for "known words" and the other one would still allow "Unknown" words

Apollon77 commented 2 years ago

I will add tests and check that once my PRs are merged