curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

How to store models locally? #56

Closed joslat closed 2 years ago

joslat commented 3 years ago

On the readme it says "When using the new model packages, you can usually remove this line from your code: Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));, or replace it with Storage.Current = new DiskStorage("catalyst-models") if you are storing your own models locally."

But it is not really clear how to download and locate the models... should it be enough to add references to the NuGet Packages or something more needs to be done?

Describe the solution you'd like A clear and concise description of how to do this is pretty much needed if we want to have a solution that does not need internet connectivity.

Describe alternatives you've considered Could it be possible to have this clarified? If so, I volunteer to write a sample for the samples section ;) - you know, it's good to give back! ;)

joslat commented 3 years ago

Hi, I tried doing this but I am unsure if this is correct...

I add some of the new NuGet Packages Catalyst.Models.{language here} in the sample app: image

Then, instead of: Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models")); I put: Storage.Current = new DiskStorage("catalyst-models");

And that's it, voila! - it seems to work but it could be due to having the models already cached...

The code is the Exactly same as the sample at https://github.com/curiosity-ai/catalyst/blob/master/samples/LanguageDetection/Program.cs except for the line 27 change...

Is this correct? :)

decay29 commented 3 years ago

I add some of the new NuGet Packages Catalyst.Models.{language here} in the sample app: image

Then, instead of: Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models")); I put: Storage.Current = new DiskStorage("catalyst-models");

And that's it, voila! - it seems to work but it could be due to having the models already cached...

Is this correct? :)

I did the same thing with the NuGets except I did not use Storage.Current, I just did

Catalyst.Models.English.Register();

as I am only using English, and that seemed to do it.

To test it out, I copied my debug folder to the Windows Sandbox, turned off networking in the sandbox, and it worked fine.

joslat commented 3 years ago

Thanks! - I will try it out tomorrow!!

joslat commented 3 years ago

Hi @decay29, I did that the instructions Catalyst.Models.{put language here}.Register(); did work but the following commands did not. Upon looking at them, I am using the following var cld2LanguageDetector = await LanguageDetector.FromStoreAsync(Language.Any, Version.Latest, ""); which says "FromStore..." so I guess it is needed to set-up the Storage / store with the command Storage.Current = new DiskStorage("catalyst-models");

Would be great if this was a bit more clear... but also it seems that it is in a process of switching the operational model... so all might change... (that is what my intuition says, which can be completely wrong... 😅

theolivenbaum commented 2 years ago

Hi @joslat,

I've deprecated now the online model storage, so it's all loading from the nuget packages. The method still needs to exist because of how the pipeline loading is currently hooked up, but if you look at the source-code for it, you'll see it loads from the assembly resource now:

public new static async Task<LanguageDetector> FromStoreAsync(Language language, int version, string tag)
{
    var a = new LanguageDetector(version, tag);

    try
    {
        using var sr1 = typeof(LanguageDetector).Assembly.GetManifestResourceStream($"Catalyst.Resources.LanguageDetector.binz");
        using var decompressed = new MemoryStream();
        using (var ds = new DeflateStream(sr1, CompressionMode.Decompress, leaveOpen: true))
        {
            await ds.CopyToAsync(decompressed);
            decompressed.Seek(0, SeekOrigin.Begin);
            a.Data = MessagePack.MessagePackSerializer.Deserialize<LanguageDetectorModel>(decompressed, Pipeline.LZ4Standard);
            a.Version = 0;
        }
    }
    catch
    {
        await a.LoadDataAsync();
    }

    return a;
}

I've still to publish the WikiNER and FastTextLanguageDetector packages (see #63), but all else has been migrated.

wenbin97 commented 2 years ago

Hi @joslat,

I've deprecated now the online model storage, so it's all loading from the nuget packages. The method still needs to exist because of how the pipeline loading is currently hooked up, but if you look at the source-code for it, you'll see it loads from the assembly resource now:

public new static async Task<LanguageDetector> FromStoreAsync(Language language, int version, string tag)
{
    var a = new LanguageDetector(version, tag);

    try
    {
        using var sr1 = typeof(LanguageDetector).Assembly.GetManifestResourceStream($"Catalyst.Resources.LanguageDetector.binz");
        using var decompressed = new MemoryStream();
        using (var ds = new DeflateStream(sr1, CompressionMode.Decompress, leaveOpen: true))
        {
            await ds.CopyToAsync(decompressed);
            decompressed.Seek(0, SeekOrigin.Begin);
            a.Data = MessagePack.MessagePackSerializer.Deserialize<LanguageDetectorModel>(decompressed, Pipeline.LZ4Standard);
            a.Version = 0;
        }
    }
    catch
    {
        await a.LoadDataAsync();
    }

    return a;
}

I've still to publish the WikiNER and FastTextLanguageDetector packages (see #63), but all else has been migrated.

Hello @curiosity-ai.

I've tried using Catalyst.Models.English.Register(); but my project with netcoreapp2.2 does not seem to recognize it. Should I migrate to latest .NET core or is there something else wrong?

image

StanislavPrusac commented 2 years ago

Hi @theolivenbaum I found that "Catalyst.Resources.LanguageDetector.binz" is not exist in assembly resource.

In "catalyst.dll" in "catalyst.1.0.25056.nupkg" (and few version before) the contents of the "LanguageDetector.binz" file should be binary data and instead have the text inside:

version https://git-lfs.github.com/spec/v1
oid sha256:0ce9d008f20ca4f099ba4eee75eddffbb5dfeb2a967467ee5a640fb5bc31c5c0
size 1166200 

There seem to be some issues with Git Large File Storage (LFS).

Therefore, language detection via CLD2 module does not work.

This line of code: var cld2LanguageDetector = await LanguageDetector.FromStoreAsync (Language.Any, Version.Latest, ""); throws out exception: System.IO.FileNotFoundException: 'Unable to find the specified file.'

It can be checked via the "./samples/LanguageDetection" project which throws out this exception with all versions of the last few months.

I built a Catalyst DLL with the correct LanguageDetector.binz file and language detection via CLD2 module works OK.

theolivenbaum commented 2 years ago

Hi @StanislavPrusac, thanks for the investigation! I'll get back to it asap!

theolivenbaum commented 2 years ago

@StanislavPrusac closed with https://github.com/curiosity-ai/catalyst/commit/cd575bfa2ce3e114a6ea03770ae440e439b283cd, next release on nuget should have the fixed version