System Information (please complete the following information):
OS & Version: [e.g. Windows 11]
ML.NET Version: [e.g. ML.NET v3.0.0]
.NET Version: [e.g. .NET 8.0]
Describe the bug
I try to Embedded a list of sentences in French, the main goal is to generated a embedded dataset for after apply the cosine Similarity. The default FastTextWikipedia300D is the english wiki, so i download the french one from https://fasttext.cc/docs/en/pretrained-vectors.html (the wiki.fr.vec is in the output build directory and always copy).
i try a lot of code but i cant figure why it s not working , i also try Issues 5532 . the generated output are always the same :
After some work i notice that if the wiki.en.vec is manually set in the folder "AppData\Local\mlnet-resources\WordVectors" it s working when i m using FastTextWikipedia300D .
So there is an issue when you manually set the full path location in ApplyWordEmbedding.
To Reproduce
Steps to reproduce the behavior:
var cast = allDataEnumerable.Select(x => new TextData() { Text = x.TextCleaned }).ToList();
var dataView = mlContext.Data.LoadFromEnumerable(cast);
var pipeline = mlContext.Transforms.Text.NormalizeText("Text")
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text"))
.Append(mlContext.Transforms.Text.ApplyWordEmbedding("Features", @"c:/wiki.fr.vec", "Tokens"));
var transformer = pipeline.Fit(dataView);
var transformedData = transformer.Transform(dataView);
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TextFeatures>(transformer);
foreach (var item in allDataEnumerable)
{
var prediction = predictionEngine.Predict(new TextData() { Text = item.TextCleaned});
Console.WriteLine($"Number of Features: {prediction.Features.Length}");
// Print the embedding vector.
Console.Write("Features: ");
foreach (var f in prediction.Features)
Console.Write($"{f:F4} ");
Console.WriteLine();
}
public class TextFeatures
{
[VectorType(300)]
public float[] Features { get; set; }
}
System Information (please complete the following information):
Describe the bug I try to Embedded a list of sentences in French, the main goal is to generated a embedded dataset for after apply the cosine Similarity. The default FastTextWikipedia300D is the english wiki, so i download the french one from https://fasttext.cc/docs/en/pretrained-vectors.html (the wiki.fr.vec is in the output build directory and always copy). i try a lot of code but i cant figure why it s not working , i also try Issues 5532 . the generated output are always the same :
After some work i notice that if the wiki.en.vec is manually set in the folder "AppData\Local\mlnet-resources\WordVectors" it s working when i m using FastTextWikipedia300D .
So there is an issue when you manually set the full path location in ApplyWordEmbedding.
To Reproduce Steps to reproduce the behavior: