dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.93k stars 1.86k forks source link

Bugs / ApplyWordEmbedding with custom path not working. #6919

Open ErwanL08 opened 6 months ago

ErwanL08 commented 6 months ago

System Information (please complete the following information):

Describe the bug I try to Embedded a list of sentences in French, the main goal is to generated a embedded dataset for after apply the cosine Similarity. The default FastTextWikipedia300D is the english wiki, so i download the french one from https://fasttext.cc/docs/en/pretrained-vectors.html (the wiki.fr.vec is in the output build directory and always copy). i try a lot of code but i cant figure why it s not working , i also try Issues 5532 . the generated output are always the same :

image

After some work i notice that if the wiki.en.vec is manually set in the folder "AppData\Local\mlnet-resources\WordVectors" it s working when i m using FastTextWikipedia300D .

So there is an issue when you manually set the full path location in ApplyWordEmbedding.

To Reproduce Steps to reproduce the behavior:

var cast = allDataEnumerable.Select(x => new TextData() { Text = x.TextCleaned }).ToList();
var dataView = mlContext.Data.LoadFromEnumerable(cast);

var pipeline = mlContext.Transforms.Text.NormalizeText("Text")
    .Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text"))
    .Append(mlContext.Transforms.Text.ApplyWordEmbedding("Features", @"c:/wiki.fr.vec", "Tokens"));

var transformer = pipeline.Fit(dataView);
var transformedData = transformer.Transform(dataView);

var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TextFeatures>(transformer);

foreach (var item in allDataEnumerable)
{
    var prediction = predictionEngine.Predict(new TextData() { Text = item.TextCleaned});

    Console.WriteLine($"Number of Features: {prediction.Features.Length}");

    // Print the embedding vector.
    Console.Write("Features: ");
    foreach (var f in prediction.Features)
        Console.Write($"{f:F4} ");

    Console.WriteLine(); 
}
  public class TextFeatures 
  {
      [VectorType(300)]
      public float[] Features { get; set; }
  }