curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

How to create our own model? #59

Closed ADD-eNavarro closed 2 years ago

ADD-eNavarro commented 3 years ago

Is your feature request related to a problem? Please describe. My enterprise is considering using your great library to analyze texts. We're talking care home environment, just to clarify. So I was wondering -and can't see anywhere- how could we create new types of tags, like for instance "meds", or modify/increase others, like adding to locations "room", "toilet", and so on.

Describe the solution you'd like An explanation on how to create and expand the tagging dicts.

After carefully reading issue 45, closely related, I get a few points:

So what I am asking for, actually, is a general guide to train a model: what method to use, where to get datasets, how to store them locally or create NuGet package (ok, that last thing is probably out of scope of Catalyst).

decay29 commented 2 years ago

As we are doing something similar to you, I would recommend the use of Spotters. These are specific dictionaries of phrases and words that you use. Fill up the spotters with the terms you would need.

Spotter spot = new Spotter(Mosaik.Core.Language.English, 0, "items", "Items");
spot.AddEntry("toilet"); 

Then add the spotter to your instance of the Pipeline.

var pipeline = await Pipeline.ForAsync(Language.English);
pipeline.add(spot);

They refer to this as the Gazetteer-like model. Look in samples under entity recognition. This works well for identifying specific words and phrases.