curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
699 stars 71 forks source link

Cannot process Chinese correctly #99

Open TomoakiChenSinica opened 1 year ago

TomoakiChenSinica commented 1 year ago

Language Which language(s) this issue relates to. Chinese

Describe the bug A clear and concise description of what the bug is. I cannot process chinese sentence correctly.

To Reproduce Steps to reproduce the behavior

  1. I ran a code like the code block in Screenshots.
  2. I got the result like:
    {"Language":"zh","Length":5,"Value":"往前走五步","TokensData":[[{"Bounds":[0,4],"Tag":"PROPN"}]]}

Expected behavior A clear and concise description of what you expected to happen. Tokenize and tag correctly

Screenshots If applicable, add a code example to help explain your problem.

Here is my code:

Catalyst.Models.Chinese.Register(); //You need to pre-register each language (and install the respective NuGet Packages)

Storage.Current = new DiskStorage("catalyst-models");
var nlp = await Pipeline.ForAsync(Language.Chinese);
var doc = new Document("諸葛亮是三國時代著名軍師", Language.Chinese);
nlp.ProcessSingle(doc);
Console.WriteLine(doc.ToJson());   

Additional context Thank you for your help!