curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Exception for some language in EntityRecognition sample #53

Closed StanislavPrusac closed 3 years ago

StanislavPrusac commented 3 years ago

Bug description When I change language model in samples/EntityRecognition/Program.cs to: Croatian, Danish, Serbian, Swedish, Arabic and Indonesian I got exception:

"HResult=0x80131500
Message=Error occurred while reading from the stream.
...
...
at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
at Catalyst.Models.AveragePerceptronEntityRecognizer.<FromStoreAsync>d__24.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Catalyst.Samples.EntityRecognition.Program.<AveragePerceptronEntityRecognizerAndPatternSpotterSample>d__1.MoveNext() in d:\dev\catalyst-master-2021-05-03\samples\EntityRecognition\Program.cs:line 49
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
at Catalyst.Samples.EntityRecognition.Program.<Main>d__0.MoveNext() in d:\dev\catalyst-master-2021-05-03\samples\EntityRecognition\Program.cs:line 35
Inner Exception 1:
NullReferenceException: Object reference not set to an instance of an object."

There is no such error for the following languages: English, German, Spanish, Portuguese, Polish, Italian, French.

Steps to reproduce the behavior:

  1. Go to this file: https://github.com/curiosity-ai/catalyst/blob/master/samples/EntityRecognition/Program.cs
  2. Change int this lines from "English" to "Croatian":
//Initialize the English built-in models
Catalyst.Models.Croatian.Register();
...
//Create a new pipeline for the english language, and add the WikiNER model to it
Console.WriteLine("Loading models... This might take a bit longer the first time you run this sample, as the models have to be downloaded from the online repository");
var nlp = await Pipeline.ForAsync(Language.Croatian);
nlp.Add(await AveragePerceptronEntityRecognizer.FromStoreAsync(language: Language.Croatian, version: Version.Latest, tag: "WikiNER"));
...

I used those PackageReference in sample project EntityRecognition:

<PackageReference Include="Catalyst" Version="1.0.16767" />  
<PackageReference Include="Catalyst.Models.English" Version="1.0.17127" />  <!--True-->       
<PackageReference Include="Catalyst.Models.Croatian" Version="1.0.17127" />   <!--False-->

P.S. Thank you for a very useful and wonderful C # NLP library (Catalyst).

theolivenbaum commented 3 years ago

Hi @StanislavPrusac!

Thanks for the report - should probably handle this case better.

The WikiNER model is unfortunately not available for the languages you mentioned, you can see the ones that are trained here. This is due to the original data we use for training only being available for those languages...

If you want to send a PR with an update to the readme, happy to add the info on the repo!

Cheers,

Rafael