curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Models and data not loading #57

Closed joerglang closed 2 years ago

joerglang commented 3 years ago

Describe the bug I have a WinForms .NET Core 5 application that uses Catalyst according to the documentation. However when trying to use the code to automatically loading the data, nothing happens. The download seems to start (as it creates some directories) but never downloads the data. I can wait for an hour in the debugger, the code doesn't return.

When running the samples of the repository (with the same code), the data is downloaded as expected. I copied the "catalyst-models" folder to my solutions and have it copied to the debug output and then loading of the FastTextLanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");works. However the pipeline = Pipeline.For(Language.English); never returns.

To Reproduce This is the code that produces the problem

        public void Init()
        {
            Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
            var t = FastTextLanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");
            languageDetector = t.WaitResult();

            pipeline = Pipeline.For(Language.English);
            initCalled = true;
        }

As this code is practically the same as in the samples, I really don't see the problem. What I have is

  1. A WinForms NET 5.0 application
  2. References a .NET 5.0 library project
  3. The library project has the Catalyst nuget packages installed (1.0.16767)
  4. The Init function above is called in the constructor of the "detector" class.

The output windows shows the following log information from Catalyst

[14:56:16 INF] [LOAD] [FastTextLanguageDetectorData-"Any"-v0] (1 B) from '..\\Models\--\FastTextLanguageDetectorData\v000000\model-FastTextLanguageDetector-v000000.bin'
[14:56:16 INF] [LOAD] [FastTextData-Version-"Any"-v-1] (1 B) from '..\\Models\--\FastTextData-Version\v-000001\model-language-detector-v-000001.bin'
[14:56:16 INF] [LOAD] [FastTextData-Version-"Any"-v-1] (1 B) from '..\\Models\--\FastTextData-Version\v-000001\model-language-detector-v-000001.bin'
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Runtime.CompilerServices.Unsafe.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
[14:56:17 INF] [LOAD] [FastTextData-"Any"-v0] (15.4 MB) from '..\\Models\--\FastTextData\v000000\model-language-detector-v000000.bin'
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Resources.Writer.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Collections.NonGeneric.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App\5.0.6\System.Configuration.ConfigurationManager.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
[14:56:19 INF] [B] Initializing Entries
[14:56:22 INF] [E] Initializing Entries in 2.8300 seconds at 413,653 oper/s, total of 1,170,682 operations
[14:56:22 INF] [LOAD] [SentenceDetectorModel-Version-"English"-v-1] (1 B) from '..\\Models\en\SentenceDetectorModel-Version\v-000001\model-v-000001.bin'
[14:56:22 INF] [LOAD] [SentenceDetectorModel-Version-"English"-v-1] (1 B) from '..\\Models\en\SentenceDetectorModel-Version\v-000001\model-v-000001.bin'
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Security.Cryptography.Csp.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
theolivenbaum commented 2 years ago

Dear @joerglang, the online model repo has now been deprecated, could you try to use instead the per-language nuget packages?

You can find them all on NuGet, for example for English: https://www.nuget.org/packages/catalyst.models.english

You need to register the models first thing before using any pipeline / model by calling this somewhere in your code:

Catalyst.Models.English.Register();

Also just that you know, the FastTextLanguageDetector model is pending being published to NuGet - see #63, but you can use the CLD2 model just fine:

var cld2LanguageDetector     = await LanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");
KoalaBear84 commented 2 years ago

I really cannot get it to work. Could you please check if this works? I have it running in .NET 6 which might be a problem.

I have installed Catalyst and Catalyst.Models.English. It would be nice if there could be a Catalyst.Models.All which depends on all available languages so it's only a single package.

ConsoleApp_20220110_1501_DetectLanguage.zip

using Catalyst;
using Catalyst.Models;
using Mosaik.Core;
using Version = Mosaik.Core.Version;

string text = "What is this language?";

Console.WriteLine("Downloading/reading language detection models..");
const string modelFolderName = "catalyst-models";

if (!new DirectoryInfo(modelFolderName).Exists)
{
    Console.WriteLine("- Downloading for the first time, so this may take a little while");
}

Storage.Current = new DiskStorage(modelFolderName);

// You need to pre-register each language (and install the respective NuGet Packages)
English.Register();

LanguageDetector? cld2LanguageDetector = await LanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");

Document? document = new Document(text);
cld2LanguageDetector.Process(document);

Console.WriteLine(text);
Console.WriteLine($"Detected language: {document.Language}");
NKuzichkin commented 2 years ago

I also can't get work language detection. I'm using the code from the example (dated October 17, 2021) and I'm getting the same error "Unable to find the specified file."

image

The following error is displayed in the console:

fail: Mosaik.Core.ObjectStore[0]
      [LOAD-ERR] LanguageDetectorModel-Any-v0 from '..\\Models\--\LanguageDetectorModel\v000000\model-v000000.binz'
      System.IO.FileNotFoundException: Unable to find the specified file.
         at Mosaik.Core.DiskStorage.OpenLockedStreamAsync(String path, FileAccess access)
         at Mosaik.Core.ObjectStore.LoadAsync[T](IStorageTarget storeTarget, Language language, String modelType, Int32 version, String tag, Boolean compress)

Please tell me what to fix to make the language detection start working?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.