curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Corpus? #22

Closed dgerding closed 4 years ago

dgerding commented 4 years ago

Can you point to the Universal Dependencies data you used? Or include it, guessing, in the Corpus project? Really excited to be able to try training.

Thanks Dave G

theolivenbaum commented 4 years ago

Hi Dave,

The training data used for the Catalyst.Training project can be found bellow:

You can also use the pre-trained models available in the online repository, for example:

//Configures the model storage to use the online repository backed by the local folder ./catalyst-models/
Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var nlp = await Pipeline.ForAsync(Language.English);
nlp.Add(await AveragePerceptronEntityRecognizer.FromStoreAsync(language: Language.English, version: Version.Latest, tag: "WikiNER"));

If you want, I can also provide you a direct download link for all the data - it's about 3.4GB without the OntoNotes dataset.

dgerding commented 4 years ago

Thanks!

ADD-eNavarro commented 3 years ago

Hi! I know this issue is long closed, but I would be grateful if that download link was published :^)