dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

Request : Apply Lemma / stemming in FeaturizeText options #5281

Open ErwanL08 opened 4 years ago

ErwanL08 commented 4 years ago

Hi First Thank you for all the work done, i know that FeaturizeText apply NLP preprocessing like skipword with a specifique language : image

But is there a way to apply lemma / stemming in this function ?

antoniovs1029 commented 4 years ago

Hi, @ErwanL08 . Unfortunately, there's no option for doing lemmatization or stemming in ML.NET, so I will mark this issue as a feature request so that we can take it into account when planning future features.

In the meantime, there are a couple of options you can explore:

  1. Apply lemmatization/stemming before creating the input DataView. I notice in your screenshot that you're using LoadFromEnumerable<>() to get your data into a DataView. If possible you can try to lemmatize/stem the strings on your input "Utterance" string field, before creating the DV. I'm not able to recommend any C# library for this, but a quick google search points to some NLP-related nugets which maybe have this functionality... I've also found some open-source implementations of basic english stemming on C#, which you might be able to add to your project without installing any nuget.

  2. Apply lemmatization/stemming inside a CustomMappingTransformer. A CustomMappingTransformer lets the user define a method that will be used to apply transformations to every row of the input; this function will be applied on an streaming fashion. You can create a function that does lemmatization/stemming (either using your own implementation or another library), and use it inside a CustomMappingTransformer. See more about this transformer on the docs.

AniaBerthelot commented 3 years ago

This feature is very important, I'm impatient to see it inside the awesome ML.NET. Also NLP is a very essential today, I hope a serious attention will be granted to it.

justinormont commented 3 years ago

I agree, there should be a direct lemmatizer/stemmer.

The default in the FeaturizeText transform uses unigrams (one word) + bigrams (two words) + tricharactergrams (three letter ngram).

The default tricharactergrams gives a good part of the gains of a full stemmer.

For example, it will extract the same tricharactergram r|u|n from runner/running/runs. This allows the model to learn the common concept of "run" from all of these, and with the ngrams it maintains the original unstemmed words, allowing the model to also learn running (unigram) and i|n|g (tricharactergram).

The word embedding transform can also help. The fastTextWikipedia300D model in particular has a large vocabulary, and already has a word vector for runner/running/runs and they will be in similar position in the embedding space.

All this said, the world is moving towards transformer networks like BERT. There's an external BERT implementation for ML․NET -- https://github.com/GerjanVlot/BERT-ML.NET by @GerjanVlot.

ErwanL08 commented 3 years ago

I totally agree with @AniaBerthelot , if ML.Net can have a .Net version of a stemmer / lemmatizer (up to date) the framework will be so awesome 👍

WhitWaldo commented 2 years ago

I would also like to see lemmatization support built into ML.NET.

michaelgsharp commented 2 years ago

@luisquintanilla for prioritization.

AlbelTec commented 1 year ago

Hi, @luisquintanilla. Is there any chance to get this feature in near future ? Actually for text data preprocessing I rely on spacy (python) and for my current C# project I really need to stick with ML.NET to avoid dependencies with libraries like python.net. So, for now my project is on hold until I figure out the best solution. any insight ?

luisquintanilla commented 1 year ago

Hi @AlbelTec

Thanks for your question. Our current NLP solutions are focused on deep learning, Text Classification and Sentence Similarity being a few examples. As a result, there are no immediate plans to work on lemmatization / stemming at this time.

That being said, would you be willing to share your use case and scenario? As we get more feedback on the topic we can think about where this fits in our future roadmap.

In the meantime, I would take a look at Antonio's comment above as a potential workaround.

AlbelTec commented 1 year ago

Hi @luisquintanilla, I followed Antonio's comment and I'm getting fair result regarding lemmatization. Basically, I imported the nuget package (LemmaSharp) into my project and create a liitle function that return the lemmatized text.

// Lemmatization (https://github.com/hc-ro/LemmaGenerator-std)

        private string Lemmatization(string text, string language)
        {
            var tokens = Tokenize(text);
            var sb = new System.Text.StringBuilder();

            //string dataLemmaFile = <path to lemmas file> + "\\full7z-multext-" + language + ".lem"; 
            string dataLemmaFile = <path to lemmas file>+ "\\full7z-mlteast-" + language + ".lem";

            var stream = File.OpenRead(dataLemmaFile);
            var lemmatizer = new Lemmatizer(stream);
            foreach (var token in tokens)
            {
                sb.Append(lemmatizer.Lemmatize(token));
                sb.Append(" ");
            }
            return sb.ToString();
        }
maryamariyan commented 11 months ago

I have been experimenting on this as well.

Seems like this could be a good addition to ML.NET, since there's a lot of upvotes for this feature and it would be convenient for .NET developers to use a built-in feature for Lemmatization.