curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Repo/Prject for NER Models? #45

Closed dgerding closed 3 years ago

dgerding commented 3 years ago

Can you please post repo that shows how you are training NER? It seems like you are using WikiNER data... but are you using anything else? Will you please share a repo/add project that duplicates how the "included" models are built?

The single biggest holdup for me for committing to us Catalyst moving ahead is the ability to see how you are doing NER training for the models that are included.

theolivenbaum commented 3 years ago

The NER model that we use in the example is trained with the code here: https://github.com/curiosity-ai/catalyst/blob/master/Catalyst.Training/src/TrainWikiNER.cs

Some models (like for date recognition) are from ML.NET and are "deterministic" i.e. not machine learning.

As for the others (Spotter/PatternSpotter), you're the one providing data or rules to train.

Did I miss any other you wanted to know?

dgerding commented 3 years ago

Hi, Thanks for the reply :) Guess I need to be more specific.

I wondered if you are willing/able to reveal what your corpus is the perceptron(?) model training for Person, Organization and Location training? I have access to the typical Reuters and ConL raw training data (via NIST), but wondered if the model that is pre-trained and pulled down from Azure is updateable/retrainable on device? I wondered if the NER model quality was something that's a focus of the project for you down the road or if I should just focus on training my own model?

Thanks. And I hope to have some value to contribute back to the project down the road.

theolivenbaum commented 3 years ago

For the model included, we used only WikiNER if I'm not mistaken - if you have access to any other public NER dataset I'd be happy to add it to the training code! Regarding re-training, the easiest is to train it from zero with the added data - just modify the Catalyst.Training project to suit your data source.