curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
699 stars 71 forks source link

Error creating Japanese NLP Pipeline #80

Open gilliganc opened 1 year ago

gilliganc commented 1 year ago

Describe the bug Trying to load the Pipeline for the Japanese model/language results in a MessagePackSerializationException This is on NET6 on windows 10.

To Reproduce

  1. add the japanese model nuget
  2. run the following code
    Catalyst.Models.Japanese.Register();
    var nlp = await Pipeline.ForAsync(Language.Japanese);

the second line will error with th exception in the Additional context

Expected behavior Create the Pipeline without error and be able to perform NLP on japanese text.

Additional context

MessagePack.MessagePackSerializationException : Error occurred while reading from the stream.
---- System.NullReferenceException : Object reference not set to an instance of an object.

  Stack Trace: 
MessagePackSerializer.DeserializeAsync[T](Stream stream, MessagePackSerializerOptions options, CancellationToken cancellationToken)
StorableObjectV2`2.LoadAsync(Stream stream)
AveragePerceptronTagger.LoadAsync(Stream stream)
<<Register>b__0_7>d.MoveNext()
--- End of stack trace from previous location ---
ResourceLoader.LoadAsync[T](Assembly assembly, String resourceFile, Func`2 loader)
<<Register>b__0_0>d.MoveNext()
--- End of stack trace from previous location ---
StorableObject`2.LoadDataAsync()
AveragePerceptronTagger.FromStoreAsync(Language language, Int32 version, String tag)
Pipeline.ForAsync(Language language, Boolean sentenceDetector, Boolean tagger)
theolivenbaum commented 1 year ago

Hi @gilliganc , thanks for reporting it. This is probably because we don't have an AveragePerceptronTagger model for Japanese. I'll investigate how to improve this.

Meanwhile you can create a "Tokenizer" only pipeline

gilliganc commented 1 year ago

thanks i think i need more than the tokenizer as i was trying to port some existing code from python to dotnet that was based around spacy to see if i could improve the performance and integrate it easier. Based on what the person that wrote the original code i need more than the tokeniser. We are trying to detect the keywords in the japanese text and the nouns i don't think just the the tokenizer would help right?

CodeRabbit957 commented 4 months ago

Is this being worked on? I still have this error. It's definitely the AveragePerceptronTagger (I'm getting NullReferenceException).

Does the tokenizer even work properly?

Is there a reason this spacy model has been ported without it? The Japanese model is pretty much useless right now if I can't get anything to work. How soon can this be fixed?

It looks like spacy haven't used Averaged Percepton Taggers since pre-version 2.0. They now use neural networks (matrix multiplication). Are all the Catalyst models based on APTs?

theolivenbaum commented 4 months ago

@CodeRabbit957 we've not updated the tagger as we're also ourselves not using it anymore in our app... In any case, Catalyst would need to incorporate a proper CJK tokenizer such as https://github.com/leungwensen/cjk-tokenizer to be able to correctly handle Japanese. If you're up for the challenge, PRs are welcome!