curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Benchmarking Information #28

Closed nimasTT closed 2 years ago

nimasTT commented 4 years ago

Can you please add some more Info's about Comparing Catalyst with sapCy? f.e. Is the Accuracy the same as the SpaCy V2?

joslat commented 3 years ago

Why not benchmark it yourself? I can do that, but... do you have a nuget/github for SpaCy/SpaCy V2?

joslat commented 3 years ago

Hi @nimasTT , I did this for myself but happy to share the code snippet ;)

It is essentially the same code at https://github.com/curiosity-ai/catalyst/blob/master/samples/LanguageDetection/Program.cs with some changes.

` private const string LongText_es = "Hay una creciente necesidad de lidiar con documentos multilingües hoy. Si pudiéramos segmentar documentos multilingües en términos lingüísticos, sería muy útil tanto para la exploración de fenómenos lingüísticos, como el cambio de código y la mezcla de código, como para el procesamiento computacional de cada segmento, según corresponda. La identificación del lenguaje a partir de un pequeño texto dado es, por lo tanto, un problema importante. Este documento trata sobre la identificación del idioma a partir de pequeñas muestras de texto.";

    public static async Task Main(string[] args)
    {
        Console.WriteLine("Trying Catalyst!!");
        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();
        TimeSpan tsPrevious = stopWatch.Elapsed;
        Console.OutputEncoding = Encoding.UTF8;
        ApplicationLogging.SetLoggerFactory(LoggerFactory.Create(lb => lb.AddConsole()));

        //Configures the model storage to use the online repository backed by the local folder ./catalyst-models/
        Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));

        var cld2LanguageDetector = await LanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");
        Console.WriteLine("Time to load the cld2LanguageDetector and models...");
        TimeSpan ts = stopWatch.Elapsed;
        printElapsed(ts, tsPrevious);

        var fastTextLanguageDetector = await FastTextLanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");
        Console.WriteLine("Time to load the FastTextLanguageDetector and models...");
        tsPrevious = ts;
        ts = stopWatch.Elapsed;
        printElapsed(ts, tsPrevious);

        var doc = new Document(LongText_es);
        fastTextLanguageDetector.Process(doc);
        Console.WriteLine("Time to process a text with the FastTextLanguageDetector and models...");
        tsPrevious = ts;
        ts = stopWatch.Elapsed;
        printElapsed(ts, tsPrevious);

        var doc2 = new Document(LongText_es);
        cld2LanguageDetector.Process(doc2);
        Console.WriteLine("Time to process a text with the cld2LanguageDetector and models...");
        tsPrevious = ts;
        ts = stopWatch.Elapsed;
        printElapsed(ts, tsPrevious);

        Console.WriteLine($"FT:\t{doc.Language}\nCLD2\t{doc2.Language}");

        Console.ReadLine();
    }

    private static void printElapsed(TimeSpan tsNew, TimeSpan tsPrevious)
    {
        var ts = tsNew - tsPrevious;

        // Format and display the TimeSpan value.
        string elapsedTime = string.Format("{0:00}:{1:00}:{2:00}.{3:00}",
            ts.Hours, ts.Minutes, ts.Seconds,
            ts.Milliseconds / 10);
        Console.WriteLine("RunTime " + elapsedTime);
    }
}`

You can apply the same principle... and of course you should also watch the memory with a tool such as dotMemory... or use BenchmarkDotNet, https://benchmarkdotnet.org/

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.