curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Catalyst.Training Details Request: OntoText & UD Version #47

Closed dgerding closed 3 years ago

dgerding commented 3 years ago

Hi, I'm trying to add the closest match UD resources and Ontonotes resources to run WikiNERTraining.

Can you point me to which US English UD files your are using? Is it UD_English-EWT?

And which Ontonotes data? Is connll formatted and /or 5.0? ( like https://github.com/ontonotes/conll-formatted-ontonotes-5.0/tree/master/conll-formatted-ontonotes-5.0/data )

Thanks!

dgerding commented 3 years ago

I'm going to assume your using Ontonotes 5 from LDC like everyone else.

Still wondering about EWT version of UD.

theolivenbaum commented 3 years ago

Hi @dgerding I updated yesterday our models to use the data from UD2.7 - also switching to a new distribution model over NuGet, and fixing a couple of issues with the training data from some english files that had the text removed.

For WikiNER, we use the data provided here. And obviously for English we use Ontonotes - that dataset is unfortunately not available for direct download, but you can request access here.