Open gilliganc opened 2 years ago
Hi @gilliganc , thanks for reporting it. This is probably because we don't have an AveragePerceptronTagger model for Japanese. I'll investigate how to improve this.
Meanwhile you can create a "Tokenizer" only pipeline
thanks i think i need more than the tokenizer as i was trying to port some existing code from python to dotnet that was based around spacy to see if i could improve the performance and integrate it easier. Based on what the person that wrote the original code i need more than the tokeniser. We are trying to detect the keywords in the japanese text and the nouns i don't think just the the tokenizer would help right?
Is this being worked on? I still have this error. It's definitely the AveragePerceptronTagger (I'm getting NullReferenceException).
Does the tokenizer even work properly?
Is there a reason this spacy model has been ported without it? The Japanese model is pretty much useless right now if I can't get anything to work. How soon can this be fixed?
It looks like spacy haven't used Averaged Percepton Taggers since pre-version 2.0. They now use neural networks (matrix multiplication). Are all the Catalyst models based on APTs?
@CodeRabbit957 we've not updated the tagger as we're also ourselves not using it anymore in our app... In any case, Catalyst would need to incorporate a proper CJK tokenizer such as https://github.com/leungwensen/cjk-tokenizer to be able to correctly handle Japanese. If you're up for the challenge, PRs are welcome!
Describe the bug Trying to load the Pipeline for the Japanese model/language results in a MessagePackSerializationException This is on NET6 on windows 10.
To Reproduce
the second line will error with th exception in the Additional context
Expected behavior Create the Pipeline without error and be able to perform NLP on japanese text.
Additional context