curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
742 stars 75 forks source link

Spacy.Initialize() Throws Exception #66

Closed lightel closed 2 years ago

lightel commented 2 years ago

Describe the bug I have a command-line .NET 5.0 application that uses Catalyst and Catalyst.Spacy libraries (both have version 1.0.23862). I was following this guide to build a minimal application to analyze text with spacy.

When I run the sample I get the following exception message:

Unhandled exception. System.Collections.Generic.KeyNotFoundException: The given key '3.2.0' was not present in the dictionary.
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at Catalyst.Spacy.LoadModelsData(ModelSize modelSize, Language[] languages)
   at Catalyst.Spacy.Initialize(ModelSize modelSize, Language[] languages)
   at catalyst_test.Program.RunSpacy() in C:\Users\andru\source\repos\catalyst_test\catalyst_test\Program.cs:line 51
   at catalyst_test.Program.Main(String[] args) in C:\Users\andru\source\repos\catalyst_test\catalyst_test\Program.cs:line 19
   at catalyst_test.Program.<Main>(String[] args)

To Reproduce Here is below the code to reproduce the issue:

            using (await Spacy.Initialize(Spacy.ModelSize.Small, Language.Any, Language.English))
            {
                var nlp = Spacy.For(Spacy.ModelSize.Small, Language.English);
                var doc = new Document("Bill Gates it the founder of Microsoft", Language.English);
                nlp.ProcessSingle(doc);
                Console.WriteLine(doc.ToJson());
            }
lightel commented 2 years ago

It turns out the compatibility.json file which is used for downloading the spacy model doesn't have version 3.2.0 anymore. Instead, they have a version 3.2:

{
  "spacy": {
    "3.2": {
      "ca_core_news_lg": [
        "3.2.0"
       ],
...
Billyish commented 2 years ago

@lightel did you resolve this particular issue? I am having the same problem.

Rafael says in his blog post:

After a bit of fiddling with how spaCy downloads and install models (and how they handle model compatibility across versions), I ended up reverse engineering the download logic and reimplementing it in C# to invoke directly the Installer.PipInstallModule method with the correct URL created for the installed spaCy version, language and model sizes requested by the user (similar to how the spaCy CLI invokes pip it in this line)

Did you resolve this by unpacking Rafael's code to redo that download logic? If so, can you share it?

theolivenbaum commented 2 years ago

Let me check here how to fix this, probably an easy fix on my side!

theolivenbaum commented 2 years ago

@lightel @Billyish just pushed a fix that should handle this case, could you test when it's published to nuget - should take an hour to be online (package version 1.0.24611)

Billyish commented 2 years ago

Thanks @theolivenbaum I was about to post to say that I have worked out the issue myself. Good to have the package update though! I will check out your update.