curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Is it possible to implement a skills extractor with catalyst using Named Entity Recognition? #33

Closed atresnjo closed 4 years ago

atresnjo commented 4 years ago

Is your feature request related to a problem? Please describe. Basically I am looking to build a service that can extract skills from a job ad.

Describe the solution you'd like A sample text would be: "We are looking for backend developers who are proficient in C#" and I'd like to extract "backend" and "C#" from it.

Describe alternatives you've considered I got parts of it working with spaCy in python, but I'd like a .NET implementation.

t3an commented 4 years ago

Waitting for a tutorial of custom NER with Catalyst

theolivenbaum commented 4 years ago

Hi @atresnjo & @ThienTran8

We've support for 3 different types of models for NER:

There is a sample for the PatternSpotter here: https://github.com/curiosity-ai/catalyst/blob/master/samples/EntityRecognition/Program.cs

For the Spotter model, the usage is fairly similar, and you can train it from your own list of entities like this:

var spotter = new Spotter(Language.Any, 0, "programming", "ProgrammingLanguage");
spotter.Data.IgnoreCase = true; //In some cases, it might be better to set it to false, and only add upper/lower-case exceptions as required

spotter.AddEntry("C#");
spotter.AddEntry("Python");
spotter.AddEntry("Python 3"); //entries can have more than one word, and will be automatically tokenized on whitespace
spotter.AddEntry("C++");
spotter.AddEntry("Rust");
spotter.AddEntry("Java");

var nlp = Pipeline.TokenizerFor(Language.English);
nlp.Add(spotter); //When adding a spotter model, the model propagates any exceptions on tokenization to the pipeline's tokenizer

var doc = new Document("Being the descendant of C and with its code compiled, C++ excels such languages as Python, C#, or any interpreted language. In terms of Rust vs. C++, Rust is frequently proclaimed to be faster than C++ due to its unique components.", Language.English);

nlp.ProcessSingle(doc);

Console.WriteLine($"Input text:\n\t'{doc.Value}'\n\nTokenized Value:\n\t'{doc.TokenizedValue}'\n\nEntities: \n{string.Join("\n", doc.SelectMany(span => span.GetEntities()).Select(e => $"\t{e.Value} [{e.EntityType.Type}]"))}");

For the AveragePerceptronEntityRecognizer, you can follow the WikiNER training code here

theolivenbaum commented 4 years ago

I've updated the sample to include the Spotter training example: https://github.com/curiosity-ai/catalyst/blob/master/samples/EntityRecognition/Program.cs

t3an commented 4 years ago

@theolivenbaum Thank you for the update. It is great.

The part of Neuralizer with Add and Forget is quite useful. However, my problem seems to be fuzzy.

For my use case of Entity Recognition, I have two types of entities: Material(leather, brick,..) and Object (house, brick,..)

For example: Make a brick (Object) and make a brick (Material) house.

From your point of view, which technique will help to solve this confusion?

Thank you very much