curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

"Collection was modified; enumeration operation may not execute" thrown by await FastTextLanguageDetector.FromStoreAsync in .NET Core 3.1 #48

Closed ProductiveRage closed 3 years ago

ProductiveRage commented 3 years ago

The following code throws an InvalidOperationException with the message "Collection was modified; enumeration operation may not execute." on the line that calls await FastTextLanguageDetector.FromStoreAsync when the application targets .NET Core 3.1.

However, it works fine when targeting .NET 5!

using System;
using System.IO;
using System.Threading.Tasks;
using Catalyst;
using Catalyst.Models;
using Mosaik.Core;
using Version = Mosaik.Core.Version;

namespace CatalystSimilarityExample
{
    class Program
    {
        static async Task Main()
        {
            const string modelFolderName = "catalyst-models";
            Storage.Current = new OnlineRepositoryStorage(new DiskStorage(modelFolderName));
            var languageDetector = await FastTextLanguageDetector.FromStoreAsync(
                Language.Any,
                Version.Latest,
                ""
            );
        }
    }
}

The stack trace shows this:

at System.ThrowHelper.ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
at System.Collections.Generic.Dictionary`2.KeyCollection.Enumerator.MoveNext()
at Catalyst.Models.FastText.CompactSupervisedModel()
at Catalyst.Models.FastTextLanguageDetector.<FromStoreAsync>d__5.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at CatalystSimilarityExample.Program.<Main>d__0.MoveNext() in C:\\Users\\Dan\\source\\repos\\ParallelLinqExample\\CatalystSimilarityExample\\Program.cs:line 17
gillonba commented 3 years ago

I am seeing this as well, when trying to run the code from the Language Detection sample. I wonder if we should consider this project to be "Early Access" or .Net 5 only? It looks like a very cool project though!

gillonba commented 3 years ago

Also, in my testing of the same code the LanguageDetector was only 56.8% accurate. For example code, not good! I run the samples to decide whether I want to use the library for real and that is somewhat less than inspiring. Maybe I should just bite the bullet call spaCy scripts from my application? It would be much better to use pure .Net but only if it works. Maybe I need to look into upgrading to .Net 5?

theolivenbaum commented 3 years ago

@gillonba I've seen similar issues recently with the fasttext language detector, need to investigate if something weird going on on net50.

Can you try meanwhile the other model for language detection?

var langDetect = LanguageDetector.FromStoreAsync(Language.Any,Version.Latest,"");
theolivenbaum commented 3 years ago

@ProductiveRage fixed the bug with loading the FT model - was a recent memory optimization added to the FT model that broke loading classifier models from disk. @gillonba building a new version of the Catalyst now, you should be able to test again. Regarding accuracy, if you could provide me some samples of the data you're testing, could check for you what's the issue.

gillonba commented 3 years ago

Good deal, I'll have another look. I just used the dataset and code from the LanguageDetection sample and added a counter to count the number correct vs the total. I don't have it in front of me and I don't recall if it was long or short (long, I think) and of course I was only able to run the LanguageDetector. If you are seeing better results, maybe I am doing something wrong? I look forward to trying FastText!

theolivenbaum commented 3 years ago

was the 56% for all the languages in the set, or for only one language? I think the model won't perform too well on rare languages - probably need some fine tuning to how we tokenize input text...

gillonba commented 3 years ago

All languages in the Data file provided with the example. I just count the number of times the predicted language matches the language of the sample. Specifically the Long sample, I think. Is there any guidance at this point on how long the sample should be to provide accurate results?

ProductiveRage commented 3 years ago

@ProductiveRage fixed the bug with loading the FT model - was a recent memory optimization added to the FT model that broke loading classifier models from disk.

Sorry, @theolivenbaum, I missed this update somehow - can confirm that I've tested with 3.1 and it works fine now! I'm closing this issue even though there seems to be a question from @gillonba about detection accuracy.. I suspect that that should be a separate issue?