dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Using ML.net for NLP/NLU #161

Closed mairaw closed 5 years ago

mairaw commented 6 years ago

@atotalnoob commented on Tue May 15 2018

Hey all,

What would need to be done to make ML.net do NLP/NLU? We use a python back-end for our current chatbot platform, looking to explore ML.net, because we use .net front-end.

My understanding of what needs to be done is:

Load in a dataset with 2 columns using TextLoader.

SentenceToBeScored | Intent

Then use a TextFeaturizer to change intents into numeric vectors

Then train and predict.

Is it that simple? Or am I missing something?

justinormont commented 6 years ago

Check out the Sentiment example here: https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Tests/Scenarios/SentimentPredictionTests.cs

Your use should be pretty similar.

atotalnoob commented 6 years ago

I've read that and all of the tutorials and docs. Not really a lot of content.

I was hoping to get a bit more guidance.

Would you go about it this way or differently?

zeahmed commented 6 years ago

You can also have a look at GitHubIssueClassification demo in the following video at 18:00 https://www.youtube.com/watch?v=OhCysVU5RDA

GalOshri commented 6 years ago

@atotalnoob your approach seems like a good starting point. You would want a delimited text file with one column being the text input and the other column being the label (intent).

If you have other information that might be valuable in predicting the intent and that information is available at the time of prediction, add it in extra columns. So maybe your data will look like: SentenceToBeScore | WhatUserWasDoingBeforeStartingChat | UserType | ... | Intent.

For your LearningPipeline, apply Dictionarizer to Intent and call it "Label". Apply TextFeaturizer to things like SentenceToBeScored and other text features. For categorical features like UserType, try CategoricalOneHotVectorizer. After transforming your features to numeric vectors, apply ColumnConcatenator so all the features are in a column called "Features". You can then add a learner (e.g. SDCA). This sample might also be useful as it shows how to work with text labels in multiclass problems.

You can try to improve your model's accuracy by modifying the hyperparameters of the transforms and learner (such as modifying the NgramLength of the TextFeaturizer as in this sample. Your results also depend on things like the scenario (does the user's sentence actually have information about the intent) and how much data you have available to train the model.

atotalnoob commented 6 years ago

@GalOshri Thanks for this, it definitely helps.

What would be the best ML algorithm to use (not just limited to ml.net)? SDCA?

I've been following the guide and your comments, but I can't seem to get it to train. Do you mind pointing out what I am doing incorrectly?

I keep getting an inner exception of InvalidOperationException: Source column 'Label' is required but not found

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;

namespace nlpTest
{
    class Program
    {
        const string dataPath = "intents.txt";
        const string testPath = "testData.txt";
        static void Main(string[] args)
        {
            var model = TrainAndPredict();
            Evaluate(model);
        }
        public static PredictionModel<IntentData, IntentPrediction> TrainAndPredict()
        {
            if (!System.IO.File.Exists(dataPath))
            {
                Console.WriteLine("File not found " + dataPath);
            }

            var pipeline = new LearningPipeline();
            pipeline.Add(new TextLoader<IntentData>(dataPath, useHeader: false, separator: "tab"));
            pipeline.Add(new TextFeaturizer(outputColumn:"Features",inputColumns:"text"));
            pipeline.Add(new Dictionarizer("Label"));
            pipeline.Add(new StochasticDualCoordinateAscentClassifier());
            PredictionModel<IntentData, IntentPrediction> model =
    pipeline.Train<IntentData, IntentPrediction>();

            IEnumerable<IntentData> intents = new[]
            {
new IntentData
{
    text = "I like pie",
    intent = "food"

},
new IntentData
{
    text = "I like pizza",
    intent = "food"

},
new IntentData
{
    text = "my favorite color is blue",
    intent = "color"

},
new IntentData
{
    text = "my favorite color is black",
    intent = "color"

}
            };

            IEnumerable<IntentPrediction> predictions = model.Predict(intents);
            var intentsAndPredictions = intents.Zip(predictions, (intent, prediction) => (intent, prediction));

            foreach (var item in intentsAndPredictions)
            {
                Console.WriteLine($"Intent: {item.intent.intent} | Prediction: {item.prediction.intent}");
            }
            Console.WriteLine();
            return model;
        }
        public static void Evaluate(PredictionModel<IntentData, IntentPrediction> model)
        {
            var testData = new TextLoader<IntentData>(testPath, useHeader: false, separator: "tab");
            var evaluator = new ClassificationEvaluator();
            ClassificationMetrics metrics = evaluator.Evaluate(model, testData);

            Console.WriteLine();
            Console.WriteLine("PredictionModel quality metrics evaluation");
            Console.WriteLine("------------------------------------------");
            Console.WriteLine($"confusion matrix: {metrics.ConfusionMatrix}");

        }

    }

    public class IntentData
    {
        [Column(ordinal: "0")]
        public string text;
        [Column(ordinal: "1", name: "Label")]
        public string intent;
    }
    public class IntentPrediction
    {
        [ColumnName("PredictedLabel")]
        public string intent;
    }
}

Contents of intents.txt (tab separated):

I like pie  food
My favorite color is blue   color
I like pizza    food
my favorite color is black  color
I like cheese   food
my favorite color is purple color
zeahmed commented 6 years ago

@atotalnoob, can you please update your nuget (or code if you are using source). This problem was fixed in issue #121.

atotalnoob commented 6 years ago

@zeahmed There is no newer Nuget package (see screenshot). I checked prerelease, as well. I'll try with the source and report back.

nuget-pkg
atotalnoob commented 6 years ago

Built from source isn't working either.

I built and added references to these assemblies:

Microsoft.ml
Microsoft.ML.Api
Microsoft.ML.Core
Microsoft.ML.CpuMath
Microsoft.ML.Data
Microsoft.ML.Maml
Microsoft.ML.Transforms
Microsoft.ML.UniveralModelFormat

Different Error, same line, which is:

An unhandled exception of type 'System.InvalidOperationException' occurred in Microsoft.ML.Data.dll
Entry point 'Trainers.StochasticDualCoordinateAscentClassifier' not found

Stack trace: " at Microsoft.ML.Runtime.EntryPoints.EntryPointNode..ctor(IHostEnvironment env, ModuleCatalog moduleCatalog, RunContext context, String id, String entryPointName, JObject inputs, JObject outputs, Boolean checkpoint, String stageId, Single cost) in C:\\machinelearning\\src\\Microsoft.ML.Data\\EntryPoints\\EntryPointNode.cs:line 509\r\n at Microsoft.ML.Runtime.EntryPoints.EntryPointNode.ValidateNodes(IHostEnvironment env, RunContext context, JArray nodes, ModuleCatalog moduleCatalog) in C:\\machinelearning\\src\\Microsoft.ML.Data\\EntryPoints\\EntryPointNode.cs:line 893\r\n at Microsoft.ML.Runtime.EntryPoints.EntryPointGraph..ctor(IHostEnvironment env, ModuleCatalog moduleCatalog, JArray nodes) in C:\\machinelearning\\src\\Microsoft.ML.Data\\EntryPoints\\EntryPointNode.cs:line 968\r\n at Microsoft.ML.Runtime.Experiment.Compile() in C:\\machinelearning\\src\\Microsoft.ML\\Runtime\\Experiment\\Experiment.cs:line 56\r\n at Microsoft.ML.LearningPipeline.Train[TInput,TOutput]() in C:\\machinelearning\\src\\Microsoft.ML\\LearningPipeline.cs:line 204\r\n at nlpTool.Program.TrainAndPredict() in C:\\Users\\UserProfile\\source\\repos\\nlpTool\\nlpTool\\Program.cs:line 34\r\n at nlpTool.Program.Main(String[] args) in C:\\Users\\UserProfile\\source\\repos\\nlpTool\\nlpTool\\Program.cs:line 19"

zeahmed commented 6 years ago

@atotalnoob, If you are using nuget v0.1.0, please update your type as follows. The issue #121 was solved later on. I tested your code. It's working with this change.

public class IntentData
{
    [Column(ordinal: "0")]
    public string text;
    [Column(ordinal: "1", name: "Label")]
    public string Label;
}
atotalnoob commented 6 years ago

Hey,

Still not working. New exception, so progress... Same line, occurs on

System.InvalidOperationException: 'Can't bind the IDataView column 'PredictedLabel' of type 'Key<U4, 0-1>' to field 'intent' of type 'System.String'.'

Can we make these exceptions more wordy? Like idk what the hell the exception is even complaining about.

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;

namespace nlpTool
{
    class Program
    {
        const string dataPath = "intents.txt";
        const string testPath = "testData.txt";
        static void Main(string[] args)
        {
            var model = TrainAndPredict();
            Evaluate(model);
        }
        public static PredictionModel<IntentData, IntentPrediction> TrainAndPredict()
        {
            if (!System.IO.File.Exists(dataPath))
            {
                Console.WriteLine("File not found " + dataPath);
            }

            var pipeline = new LearningPipeline();
            pipeline.Add(new TextLoader<IntentData>(dataPath, useHeader: false, separator: "tab"));
            pipeline.Add(new TextFeaturizer(outputColumn:"Features",inputColumns:"text"));
            pipeline.Add(new Dictionarizer("Label"));
            pipeline.Add(new StochasticDualCoordinateAscentClassifier());
            PredictionModel<IntentData, IntentPrediction> model =
    pipeline.Train<IntentData, IntentPrediction>();

            IEnumerable<IntentData> intents = new[]
            {
new IntentData
{
    text = "I like pie",
    Label = "food"

},
new IntentData
{
    text = "I like pizza",
    Label = "food"

},
new IntentData
{
    text = "my favorite color is blue",
    Label = "color"

},
new IntentData
{
    text = "my favorite color is black",
    Label = "color"

}
            };

            IEnumerable<IntentPrediction> predictions = model.Predict(intents);
            var intentsAndPredictions = intents.Zip(predictions, (intent, prediction) => (intent, prediction));

            foreach (var item in intentsAndPredictions)
            {
                Console.WriteLine($"Intent: {item.intent.Label} | Prediction: {item.prediction.intent}");
            }
            Console.WriteLine();
            return model;
        }
        public static void Evaluate(PredictionModel<IntentData, IntentPrediction> model)
        {
            var testData = new TextLoader<IntentData>(testPath, useHeader: false, separator: "tab");
            var evaluator = new ClassificationEvaluator();
            ClassificationMetrics metrics = evaluator.Evaluate(model, testData);

            Console.WriteLine();
            Console.WriteLine("PredictionModel quality metrics evaluation");
            Console.WriteLine("------------------------------------------");
            Console.WriteLine($"confusion matrix: {metrics.ConfusionMatrix}");

        }

    }

    public class IntentData
    {
        [Column(ordinal: "0")]
        public string text;
        [Column(ordinal: "1", name: "Label")]
        public string Label;
    }
    public class IntentPrediction
    {
        [ColumnName("PredictedLabel")]
        public string intent;
    }
}
Sorrien commented 6 years ago

I used this as a starting point for intent analysis, and did my best to make a cleaned up version of your example. Intent Analysis Example This works for me with the latest version in nuget. @atotalnoob

sbiaudet commented 6 years ago

@Sorrien, Yes your code is running but it Always return food label ?? Any idea ?

Sorrien commented 6 years ago

The example data probably isn't enough to get a properly trained model. Try adding more examples.

On Tue, May 29, 2018, 4:47 PM Sébastien BIAUDET notifications@github.com wrote:

@Sorrien https://github.com/Sorrien, Yes your code is running but it Always return food label ?? Any idea ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning/issues/161#issuecomment-392938464, or mute the thread https://github.com/notifications/unsubscribe-auth/AIEc4rxCSwDTeWND-OgOAAjvejjNFDxvks5t3bPKgaJpZM4UAG5F .

sbiaudet commented 6 years ago

I found,

in the program.cs, input is putting in label property instead of text property.

now it's working

Sorrien commented 6 years ago

Whoops, I'll update that in the repo (updated now)

On Wed, May 30, 2018, 4:23 AM Sébastien BIAUDET notifications@github.com wrote:

I found,

in the program.cs, input is putting in label property instead of text property.

now it's working

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning/issues/161#issuecomment-393074388, or mute the thread https://github.com/notifications/unsubscribe-auth/AIEc4l6RVGA_-_UEQouz-LwtuE6dXUE7ks5t3lcEgaJpZM4UAG5F .

Oceania2018 commented 6 years ago

Hi, I've made a project to help building up a chatbot platform in C#. Welcome to try the repo.

codemzs commented 5 years ago

Please using NimbusML for training ML.NEt models in python and then you can use ML.NET for scoring in your dotnet apps.

GuntaButya commented 5 years ago

Interesting and helpful!

I have taken and added to your DS:

Text, Label, "I like pie", "food", "I like pizza", "food", "my favorite color is blue", "color", "my favorite color is black", "color", "green is a cool color", "color", "I am hungry, I want sausage's", "food",

Running this on ML.NET, it worked first time. I saved the data as: Input.csv

Thanks for this!