curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
742 stars 75 forks source link

Language detection is non deterministic #71

Open diegosasw opened 2 years ago

diegosasw commented 2 years ago

The language detection is not deterministic. The same text is correctly found to be Spanish sometimes and Portuguese some other times.

Is this expected?

Sample:

public class CatalystLanguageDetector
    : ILanguageDetector
{
    private readonly IDictionary<string, string> _supportedLanguages =
        new Dictionary<string, string>
        {
            { "Bulgarian", "bg" },
            { "Czech", "cs" },
            { "Danish", "da" },
            { "German", "de" },
            { "Greek_Modern", "el" },
            { "English", "en" },
            { "Spanish", "es" },
            { "Estonian", "et" },
            { "Finnish", "fi" },
            { "French", "fr" },
            { "Hungarian", "hu" },
            { "Italian", "it" },
            { "Japanese", "ja" },
            { "Lithuanian", "lt" },
            { "Latvian", "lv" },
            { "Dutch", "nl" },
            { "Polish", "pl" },
            { "Portuguese", "pt" },
            { "Romanian", "ro" },
            { "Russian", "ru" },
            { "Slovak", "sk" },
            { "Slovenian", "sl" },
            { "Swedish", "sv" },
            { "Chinese", "zh" }
        };

    public async Task<LanguageDetectorResult> Detect(string text, CancellationToken cancellationToken = default)
    {
        Bulgarian.Register();
        Czech.Register();
        Danish.Register();
        German.Register();
        Greek_Modern.Register();
        English.Register();
        Spanish.Register();
        Estonian.Register();
        Finnish.Register();
        French.Register();
        Hungarian.Register();
        Italian.Register();
        Japanese.Register();
        Lithuanian.Register();
        Latvian.Register();
        Dutch.Register();
        Polish.Register();
        Portuguese.Register();
        Romanian.Register();
        Russian.Register();
        Slovak.Register();
        Slovenian.Register();
        Swedish.Register();
        Chinese.Register();

        Storage.Current = new DiskStorage("catalyst-models");

        //var fastTextLanguageDetector = await FastTextLanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");
        var cld2LanguageDetector     = await Catalyst.Models.LanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");

        var doc = new Document(text);
        cld2LanguageDetector.Process(doc);

        var isSupported = _supportedLanguages.TryGetValue(doc.Language.ToString()!, out var languageCode);
        if (!isSupported || languageCode is null)
        {
            languageCode = string.Empty;
        }
        var result =
            new LanguageDetectorResult
            {
                Text = text,
                TextLanguageCode = languageCode
            };

        return result;
    }
}

Tests

[Theory]
[InlineData("Hay una creciente necesidad de lidiar con documentos multilingües hoy. Si pudiéramos segmentar documentos multilingües en términos lingüísticos, sería muy útil tanto para la exploración de fenómenos lingüísticos, como el cambio de código y la mezcla de código, como para el procesamiento computacional de cada segmento, según corresponda. La identificación del lenguaje a partir de un pequeño texto dado es, por lo tanto, un problema importante. Este documento trata sobre la identificación del idioma a partir de pequeñas muestras de texto.", "es")]
public async Task Then_It_Should_Detect_Expected_Language_Code(string text, string expectedLanguageCode)
{
    // Given
    var serviceProvider =
        new ServiceCollection()
            .AddCatalystLanguageDetector()
            .BuildServiceProvider(
                new ServiceProviderOptions
                {
                    ValidateScopes = true,
                    ValidateOnBuild = true
                });

    var sut = serviceProvider.GetRequiredService<ILanguageDetector>();

    // When
    var result = await sut.Detect(text);
    var expectedResult =
        new LanguageDetectorResult
        {
            Text = text,
            TextLanguageCode = expectedLanguageCode
        };

    // Then
    result.Should().BeEquivalentTo(expectedResult);
}

Sometimes the very same text it is correctly detected as Spanish es but sometimes it fails because it's detected as Portuguese pt without altering anything in the code.

Expected property result.TextLanguageCode to be "es", but "pt" differs near "pt" (index 0).

Sometimes it's detected as English.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

theolivenbaum commented 2 years ago

Still relevant