🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
[Theory]
[InlineData("Hay una creciente necesidad de lidiar con documentos multilingües hoy. Si pudiéramos segmentar documentos multilingües en términos lingüísticos, sería muy útil tanto para la exploración de fenómenos lingüísticos, como el cambio de código y la mezcla de código, como para el procesamiento computacional de cada segmento, según corresponda. La identificación del lenguaje a partir de un pequeño texto dado es, por lo tanto, un problema importante. Este documento trata sobre la identificación del idioma a partir de pequeñas muestras de texto.", "es")]
public async Task Then_It_Should_Detect_Expected_Language_Code(string text, string expectedLanguageCode)
{
// Given
var serviceProvider =
new ServiceCollection()
.AddCatalystLanguageDetector()
.BuildServiceProvider(
new ServiceProviderOptions
{
ValidateScopes = true,
ValidateOnBuild = true
});
var sut = serviceProvider.GetRequiredService<ILanguageDetector>();
// When
var result = await sut.Detect(text);
var expectedResult =
new LanguageDetectorResult
{
Text = text,
TextLanguageCode = expectedLanguageCode
};
// Then
result.Should().BeEquivalentTo(expectedResult);
}
Sometimes the very same text it is correctly detected as Spanish es but sometimes it fails because it's detected as Portuguese pt without altering anything in the code.
Expected property result.TextLanguageCode to be "es", but "pt" differs near "pt" (index 0).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The language detection is not deterministic. The same text is correctly found to be Spanish sometimes and Portuguese some other times.
Is this expected?
Sample:
Tests
Sometimes the very same text it is correctly detected as Spanish
es
but sometimes it fails because it's detected as Portuguesept
without altering anything in the code.Sometimes it's detected as English.