dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Verify word embedding model downloader #5532

Open justinormont opened 3 years ago

justinormont commented 3 years ago

Internal user reported a stall during the .Fit() of the word embedding transform.

On first use of the word embedding transform, it downloads the word embedding model from the CDN.

To test:

  1. Clear any copies of the fastText300D word embedding file from local machine
    Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named wiki.en.vec
  2. Create example code using the FastTextWikipedia300D (6.6GB) in the word embedding transform
  3. Time how long it takes to download (or fail)

Example code:

var featurizeTextOptions = new TextFeaturizingEstimator.Options()
{
    // Produce cleaned tokens for input to the word embedding transform
    OutputTokensColumnName = "OutputTokens", 

    // Text cleaning (not shown is stop word removal)
    KeepDiacritics = true, // Non-default
    KeepPunctuations = false,
    KeepNumbers = false, // Non-default
    CaseMode = TextNormalizingEstimator.CaseMode.Lower,

    // Row-wise normalization (see: NormalizeLpNorm)
    Norm = TextFeaturizingEstimator.NormFunction.L2,

    // Use ML.NET's built-in stop word remover (non-default)
    StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options() { Language = TextFeaturizingEstimator.Language.English },

    // ngram options
    WordFeatureExtractor = new WordBagEstimator.Options()
    {
        NgramLength = 2,
        UseAllLengths = true, // Produce both unigrams and bigrams
        Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF  or IDF
    },

    // chargram options
    CharFeatureExtractor = new WordBagEstimator.Options()
    {
        NgramLength = 3,
        UseAllLengths = false, // Produce only tri-chargrams and not single/double characters
        Weighting = NgramExtractingEstimator.WeightingCriteria.Tf, // Can also use TF-IDF  or IDF
    },
};

// Featurization pipeline
var pipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label") // Needed for multi-class to convert string labels to the Key type

    // Create ngrams, and cleaned tokens for the word embedding
    .Append(mlContext.Transforms.Text.FeaturizeText("FeaturesText", featurizeTextOptions, new[] { "InputText" })) // Use above options object

    // Word embedding transform reads in the cleaned tokens from the text featurizer
    .Append(mlContext.Transforms.Text.ApplyWordEmbedding("FeaturesWordEmbedding", 
        "OutputTokens", WordEmbeddingEstimator.PretrainedModelKind.FastTextWikipedia300D))

    // Feature vector is the concatenation of the ngrams from the text transform, and the word embeddings
    .Append(mlContext.Transforms.Concatenate("Features", new[] { "FeaturesText", "FeaturesWordEmbedding" }))

    // Enable if numeric features are also included. Normalization is generally unneeded if only using the output from FeaturizeText as it's row-wise normalized w/ a L2-norm; word embeddings are also well behaved.
    //.Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))

    // Cache the featurized dataset in memory for added speed
    .AppendCacheCheckpoint(mlContext);

// Trainer 
var trainer = mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.AveragedPerceptron(labelColumnName: "Label", numberOfIterations: 10, featureColumnName: "Features"), labelColumnName: "Label")
    .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

var trainingPipeline = pipeline.Append(trainer);

The code here shows a full example of the FeaturizeText for use with the ApplyWordEmbedding. Specifically, it creates the tokens for the ApplyWordEmbedding by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.

Side note: We should make a sample of FeaturizeText with ApplyWordEmbedding. I wrote the above since I couldn't locate one to link-to in this issue.

Additional user report: https://github.com/dotnet/machinelearning/issues/5450#issuecomment-714930905

pree-T commented 2 years ago

I want to work on this. Can anyone help me?

ChristianRazvan commented 7 months ago

Hello! How we can go about using other language embedings for FastTextWikipedia300D? I mean if I use wiki.LangPrefix.vec with a language that isn't in the enums of ML the .fit() method just never finishes