sfiruch commented 4 years ago

Intro

ML.NET is not yet very well equipped for some natural-language processing (NLP) workloads. While there is already support for basic processing steps (tokenization, stop word removal, ...) and sentiment, other, higher-level workloads are not yet supported.

Feature request

Pre-trained embeddings, like BERT or GloVe, for documents are useful for down-level tasks.
It'd be even better if there was an easy to use API to tune or train custom models on custom datasets.

Use case

In our specific use case, we develop document classifiers. We only have a limited set of labeled documents to train with. Our plan is to use a pretrained or trained document embeddings, and learn a simple classifier on top, using the labeled documents.

Workarounds

There is already a project that runs BERT as ONNX on top of ML.NET, see https://github.com/GerjanVlot/BERT-ML.NET. I'd like to see this become an official part of ML.NET, with a good API, properly maintained and updated.

Outlook

These models are building-blocks for other features, like entity recognition (#630). Ideally ML.NET would support many more NLP tasks, as listed in https://github.com/microsoft/nlp-recipes#content. Generally, we notice an uptake in NLP-related project inquiries.

jwood803 commented 4 years ago

They do have support for word embeddings with the ApplyWordEmbedding method. It can use a predefined model or you can use a custom model with it.

Here's the pipeline from the sample:

var textPipeline = mlContext.Transforms.Text.NormalizeText("Text")
  .Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text"))
  .Append(mlContext.Transforms.Text.ApplyWordEmbedding("Features",                  
     "Tokens", WordEmbeddingEstimator.PretrainedModelKind.SentimentSpecificWordEmbedding));

Hope that helps!

Lynx1820 commented 4 years ago

Updating the title. As @jwood803 mentioned above, we already support pretrained word embeddings. You can also consume a contextualized word embedding model for prediction with Onnx. However, we currently don't support training for these models.

sfiruch commented 4 years ago

@Lynx1820 I think the title change to just "training" does not capture my feature request. The pretrained models are all context-free word embeddings. Such embeddings are not/no longer state-of-the-art.

Lynx1820 commented 4 years ago

Thanks for the clarification. I wrote training because inferencing is already possible by importing an Onnx model, but I understand your ask for a better API.

aaidarov commented 4 years ago

Hi @Lynx1820! Has there been any progress on this issue? Any updates?

On another note, I am experimenting with BERT on ML.NET. Where can I find example codes of end-to-end ML with BERT as part of preprocessing for classification tasks?

Lynx1820 commented 4 years ago

Unfortunately, there hasn't been any progress on this request. You can take a look at the workaround repo, provided in the description, as we don't have an example using BERT. Note that the pipeline would like very similar to other inferencing onnx models. You would just have another onnx model.

Gigabyte0x1337 commented 3 years ago

@aaidarov Take look at https://github.com/GerjanVlot/BERT-ML.NET it might give you some ideas.

GeorgeS2019 commented 3 years ago

@GerjanVlot We are attempting to develop a .NET Onnx viewer/editor that will make integration of e.g. BERT ONNX to .NET for intuitive. We need your feedback!!!!

luisquintanilla commented 2 years ago

Hi all,

Thanks for the discussion and feedback. We've recently implemented the Text Classification API which might satisfy some of the requirements listed here. The underlying model is based on a BERT-variant.

For more information, see the Blog post and sample notebook.

Closing this issue for now.

dotnet / machinelearning

Feature request: Built-in support for ELMO/BERT embeddings #5159

Intro

Feature request

Use case

Workarounds

Outlook