dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.89k forks source link

Named Entity Recognizer #630

Open MaxAkbar opened 6 years ago

MaxAkbar commented 6 years ago

Hello ML.NET,

Is there any way I can use ML.NET to created named entities?

Thanks, -Max

MaxAkbar commented 2 years ago

NER is named entity recognition, meaning we extract entities from some text. If you want Intent, then that is something else. NLP natural language processing can understand human language read\write\spoken. That said, NLP groups the smaller parts like NER, Intent, sentiment (something ML.net does well now), relations, and more. This topic was specific to NER.

Azure has LUIS services, which I have used to automate customer service requests and direct users appropriately. Azure LUIS service does intent\interpret user goals extremely well.

ADD-eNavarro commented 2 years ago

+1 to NER in ML.NET, and also a tool to apply custom labels to texts would be awesome, allowing the whole process to be done within one tool.

For our use case, we need to use quite a few custom labels. Here's an example: image

luisquintanilla commented 2 years ago

Thanks all for the replies. As far as tagging tools, what do you use today? I see AWS Sagemaker and LUIS were mentioned.

DierkDroth commented 2 years ago

Tagging: AWS Sagemaker GroundTruth Running: AWS Comprehend Custom Entity Recognition

paulirwin commented 2 years ago

+1, wanted to chime in with another use case. I'm looking to detect work item numbers in text. These can be in GitHub-like #-format, or JIRA-like PROJ-1234 format, but often they are formatted in different ways as they are input by humans.

Anonymized examples and the expected results (scores not shown):

Examples of false positives that I'd hope would have lower scores:

As you can imagine, a simple regex approach gets a good ways there, but doesn't have scores. Simple number recognition isn't really an ideal solution either. Would be nice to have the ability to train custom models for this somehow. Hope this helps your planning, even if this is not a supported use case.

MaxAkbar commented 2 years ago

@paulirwin From what I understand NER doesn't do that. I suppose you could use a RegEx during entity selection, and I believe the Standford NER has this, but I haven't dug too deep. Have a look at recognizers and maybe that will help.: https://github.com/microsoft/Recognizers-Text.

paulirwin commented 2 years ago

@MaxAkbar The 7-class Stanford NER detects money, percentages, and dates/times, and the Azure Text Analytics API can detect quantities, phone numbers, and IP addresses as entities, so work items (or more abstractly, alphanumeric identifiers) doesn't seem all that far-fetched for NER. But just wanted to throw my use case in the hat in case it was interesting 😄

MaxAkbar commented 2 years ago

@paulirwin Actually Stanford NER does allow custom entities to be trained. In my earlier posts, I have a link somewhere. I have done some testing with it but it's not easy to follow :) documentation. maybe now they have something a little better.

What I wanted to say about the Text Recognizers is that you can follow their example and create your own, modeling after the same pattern.

You are correct to have as many use cases and see if this is something that can be done.

torronen commented 2 years ago

I agree that for ML NER to be most useful it should be trainable. If there would be only fixed-type extractors (money, percentage, email address...) I believe there are better (deterministic) ways to do it. For example, IP addresses and emails can and should be extracted with a regex that also validates the format. Although, I do not see anything wrong with having more entity recognizers which are not based on machine learning. Anything helps, even a tested libary for IP address search with regex.

My no.1 use case is places in general, including inflections in different languages. Especially places which I can not detect based on a list of words, like countries. For my use case "places" means something to search on Map apps (cities, streets, regions, landmarks...).

Speculation (not my actual use cases)

FranzScharf commented 1 year ago

I would use NER to extract data for a reservation from unformatted free text, such as from a text in an email, such as time, date, number of people, name, room, phone number, extra wishes, comments and so on.. it has become pretty quiet around this topic here. Is that even still on the agenda?

Leftyx commented 1 year ago

After 4 years since the first post we are still waiting for NER in ML.NET. Any news ? If anyone interested I put together some code to use Hugging Face models, exported in ONNX and used with ML.NET. It seems to work quite well.

luisquintanilla commented 1 year ago

NER support was added to ML.NET as part of this PR.

https://github.com/dotnet/machinelearning/pull/6760

Here is a sample showing how to use the API

https://github.com/dotnet/machinelearning/blob/7fe293da31a05b70dddf4eba439f7bc23e3016c6/test/Microsoft.ML.Tests/NerTests.cs#L33

ADD-eNavarro commented 1 year ago

Good morning @luisquintanilla

I haven't found an doc / tutorials / samples about NER in ML.NET. I guess that's part of the roadmap when it states that documentation is to be updated, but then a couple questions arise.

In the sample you point, it checks the detection of person / city / country entities. I'm assuming one could use custom entities, the ones that fit the dataset to train the model with, since in the background it's a MultiClassClassifier, right? Any ideas on when there will be doc -and, ideally, samples with custom labels- for NER?

Thank you.

luisquintanilla commented 1 year ago

Hi @AAD-eNavarro,

We don't have a timeline on the full tutorial / sample.

You are right though that you should be able to use custom entities as part of your dataset. Let us know if you run into issues though.

MaxAkbar commented 1 year ago

@luisquintanilla Thank you for this 🙏. I have a few questions. Is there a limit to the length of the sentence? Also, how many sentences should we provide to help the NER extract entities? In your unit test, you have the same sentence, should it not be different?

luisquintanilla commented 1 year ago

@luisquintanilla Thank you for this 🙏. I have a few questions. Is there a limit to the length of the sentence? Also, how many sentences should we provide to help the NER extract entities? In your unit test, you have the same sentence, should it not be different?

Is there a limit to the length of the sentence - 512 tokens

how many sentences should we provide to help the NER extract entities - Good question. there's not hard cutoff. The more examples the better but also make sure those samples are representative.

In your unit test, you have the same sentence, should it not be different? - Yes. That should be fixed. Thanks for catching that.

ADD-eNavarro commented 1 year ago

Hi @luisquintanilla

I'm trying to replicate the test you mention, but even when I use the preview.23266.6 version of Microsoft.ML, but I get three errors: The name 'ML' does not exist in the current context -> Wherever ML is used 'SchemaShape' does not contain a definition for 'Create' -> on line 60 The name 'TestEstimatorCore' does not exist in the current context -> on line 71

I'm guessing it's because I haven't been able to find the Microsoft.ML.RunTests package, at least the two last errors look like it. I can't get mi mind around the first one. How can I run NER without the tests package, as if it was for production?

(PS: I'm sick today, I may be missing an obvious point. If that's the case, please point it out for me, and my apologies)

florianA1 commented 1 year ago

Hello, I've tried the new preview version, but the error below is raised.

Am I missing something ?

Exception :

Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): Expected input batch_size (20) to match target batch_size (16).
Exception raised from nll_loss_nd_symint at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\LossNLL.cpp:682 (most recent call first):
00007FF8342FD24200007FF8342FD1E0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FF8342C54A500007FF8342C5430 c10.dll!c10::ValueError::ValueError [<unknown file> @ <unknown line number>]
00007FF8220CAE4C00007FF8220CA260 torch_cpu.dll!at::native::nll_loss_nd_symint [<unknown file> @ <unknown line number>]
00007FF822F5EC5600007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FF822F4611900007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>]
00007FF82275D45300007FF82275D270 torch_cpu.dll!at::_ops::nll_loss_nd::call [<unknown file> @ <unknown line number>]
00007FF8220C995600007FF8220C95E0 torch_cpu.dll!at::native::cross_entropy_loss_symint [<unknown file> @ <unknown line number>]
00007FF822F5CDA500007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FF822F461BB00007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>]
00007FF8229B3DDC00007FF8229B3B70 torch_cpu.dll!at::_ops::cross_entropy_loss::call [<unknown file> @ <unknown line number>]
00007FF8336FC6B600007FF8336FC4F0 LibTorchSharp.DLL!THSNN_cross_entropy [<unknown file> @ <unknown line number>]
00007FF7E432FA25 <unknown symbol address> !<unknown symbol> [<unknown file> @ <unknown line number>]

   at TorchSharp.torch.CheckForErrors()
   at TorchSharp.Modules.CrossEntropyLoss.forward(Tensor input, Tensor target)
   at Microsoft.ML.TorchSharp.NasBert.NasBertTrainer`2.NasBertTrainerBase.RunModelAndBackPropagate(List`1& inputTensors, Tensor& targetsTensor)
   at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.TrainStep(IHost host, DataViewRowCursor cursor, ValueGetter`1 labelGetter, List`1& inputTensors, List`1& targets)
   at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.Train(IHost host, IDataView input)
   at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)

Dependencies and code :

MLContext mLContext = new();
var labels = mLContext.Data.LoadFromEnumerable(
    new[] { new Label { Key = "PERSON" }, new Label { Key = "CITY" }, new Label { Key = "COUNTRY" } });

var dataView = mLContext.Data.LoadFromEnumerable(
    new List<TestSingleSentenceData>(new TestSingleSentenceData[] {
        new TestSingleSentenceData()
        {
            Sentence = "Alice and Bob live in Liechtenstein",
            //Sentence = "Alice and Bob live in France",
            Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "COUNTRY"}
        },
        new TestSingleSentenceData()
        {
            Sentence = "Alice and Bob live in the USA",
            Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
        },
    }));

var chain = new EstimatorChain<ITransformer>();
var estimator = chain.Append(mLContext.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
   .Append(mLContext.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "LabelsOut"))
   .Append(mLContext.Transforms.Conversion.MapKeyToValue("LabelsOut"));

var transformer = estimator.Fit(dataView);

Thanks, Florian

lahbton commented 1 year ago

Hi everyone

I've been keeping an eye on this feature for a while & have created a very simple test project to use the code from the TestSimpleNer method in the NerTests class to see if it would run through OK, but I am seeing an error at the point of creating the transformer object: Exception raised at: var transformer = estimator.Fit(dataView); Error message: Field not found: 'TorchSharp.torch.CUDA'

The error seems to indicate that a field definition is missing, but the name appears to be a TorchSharp component?

I only want to use CPU (& not GPU), so have the following 3 packages: Microsoft.ML (3.0.0-preview.23511.1) Microsoft.ML.TorchSharp (0.21.0-preview.23511.1) Torchsharp-cpu (0.101.1)

I posted on StackOverflow but I'm assuming that there is little knowledge out there because of how new this feature is: https://stackoverflow.com/questions/77440001/cuda-issue-with-ner-named-entity-recognition-for-ml-predictions

Any help would be greatly appreciated.

Leftyx commented 1 year ago

Hi everyone

I've been keeping an eye on this feature for a while & have created a very simple test project to use the code from the TestSimpleNer method in the NerTests class to see if it would run through OK, but I am seeing an error at the point of creating the transformer object: Exception raised at: var transformer = estimator.Fit(dataView); Error message: Field not found: 'TorchSharp.torch.CUDA'

The error seems to indicate that a field definition is missing, but the name appears to be a TorchSharp component?

I only want to use CPU (& not GPU), so have the following 3 packages: Microsoft.ML (3.0.0-preview.23511.1) Microsoft.ML.TorchSharp (0.21.0-preview.23511.1) Torchsharp-cpu (0.101.1)

I posted on StackOverflow but I'm assuming that there is little knowledge out there because of how new this feature is: https://stackoverflow.com/questions/77440001/cuda-issue-with-ner-named-entity-recognition-for-ml-predictions

Any help would be greatly appreciated.

I have tried to reply with the best of my knowledge. You might not like the answer though :-)

lahbton commented 1 year ago

Thank you SO much for the response. I'll take a further look and see if I can get the demo working with your tips.

I've had several people look at this and posted in several places, but it's been silent up to now. It seems as though this new ML feature is too new (with very little documentation) for their to be much knowledge out there.

Thanks again!

On Fri, Nov 10, 2023 at 3:09 PM Leftyx @.***> wrote:

Hi everyone

I've been keeping an eye on this feature for a while & have created a very simple test project to use the code from the TestSimpleNer method in the NerTests class to see if it would run through OK, but I am seeing an error at the point of creating the transformer object: Exception raised at: var transformer = estimator.Fit(dataView); Error message: Field not found: 'TorchSharp.torch.CUDA'

The error seems to indicate that a field definition is missing, but the name appears to be a TorchSharp component?

I only want to use CPU (& not GPU), so have the following 3 packages: Microsoft.ML (3.0.0-preview.23511.1) Microsoft.ML.TorchSharp (0.21.0-preview.23511.1) Torchsharp-cpu (0.101.1)

I posted on StackOverflow but I'm assuming that there is little knowledge out there because of how new this feature is: https://stackoverflow.com/questions/77440001/cuda-issue-with-ner-named-entity-recognition-for-ml-predictions

Any help would be greatly appreciated.

I have tried to reply with the best of my knowledge. You might not like the answer though :-)

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning/issues/630#issuecomment-1806370636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD34WZ3EGRIT2LJSXUTUOGTYD2CXNAVCNFSM4FNOLYM2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBQGYZTOMBWGM3A . You are receiving this because you commented.Message ID: @.***>

anrouxel commented 1 year ago

Hello @lahbton and @Leftyx, I've managed to get your example to work, but I've just turned it into a console. The problem came from the version that the "libtorch-cpu-win-x64" or whatever you were using. Microsoft.ML 3.0.0-preview.23511.1 and Microsoft.ML.TorchSharp 0.21.0-preview.23511.1 use the version "libtorch-cpu-win-x64" or other 1.13.0.1.

test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj

  <ItemGroup Condition="'$(TargetArchitecture)' == 'x64'">
    <PackageReference Include="libtorch-cpu-win-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Windows')) AND '$(TargetArchitecture)' == 'x64'" />
      <!-- <PackageReference Include="TorchSharp-cuda-windows" Version="0.99.5" Condition="$([MSBuild]::IsOSPlatform('Windows'))" />   -->
    <PackageReference Include="libtorch-cpu-linux-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Linux')) AND '$(TargetArchitecture)' == 'x64'" />
    <PackageReference Include="libtorch-cpu-osx-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('OSX')) AND '$(TargetArchitecture)' == 'x64'" />
  </ItemGroup>

eng/Versions.props

<LibTorchVersion>1.13.0.1</LibTorchVersion>

Here is my code : Program.cs

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.TorchSharp;

public class Program
{
    // Main method
    public static void Main(string[] args)
    {
        try
        {
            var context = new MLContext();
            context.FallbackToCpu = true;
            context.GpuDeviceId = null;

            var labels = context.Data.LoadFromEnumerable(
            new[] {
                new Label { Key = "PERSON" },
                new Label { Key = "CITY" },
                new Label { Key = "COUNTRY"  }
            });

            var dataView = context.Data.LoadFromEnumerable(
                new List<TestSingleSentenceData>(new TestSingleSentenceData[] {
                    new TestSingleSentenceData()
                    {   // Testing longer than 512 words.
                        Sentence = "Alice and Bob live in the USA",
                        Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
                    },
                     new TestSingleSentenceData()
                     {
                        Sentence = "Alice and Bob live in the USA",
                        Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
                     },
                }));
            var chain = new EstimatorChain<ITransformer>();
            var estimator = chain.Append(context.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
               .Append(context.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "outputColumn"))
               .Append(context.Transforms.Conversion.MapKeyToValue("outputColumn"));

            var transformer = estimator.Fit(dataView);
            transformer.Dispose();

            Console.WriteLine("Success!");
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }

    private class Label
    {
        public string Key { get; set; }
    }

    private class TestSingleSentenceData
    {
        public string Sentence;
        public string[] Label;
    }
}

ConsoleApp1.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

  <ItemGroup>
      <PackageReference Include="libtorch-cpu-win-x64" Version="1.13.0.1" />
      <PackageReference Include="Microsoft.ML" Version="3.0.0-preview.23511.1" />
      <PackageReference Include="Microsoft.ML.TorchSharp" Version="0.21.0-preview.23511.1" />
  </ItemGroup>

</Project>

image

Best Regards,

anrouxel

iuribrindeiro commented 1 year ago

Hello @lahbton and @Leftyx, I've managed to get your example to work, but I've just turned it into a console. The problem came from the version that the "libtorch-cpu-win-x64" or whatever you were using. Microsoft.ML 3.0.0-preview.23511.1 and Microsoft.ML.TorchSharp 0.21.0-preview.23511.1 use the version "libtorch-cpu-win-x64" or other 1.13.0.1.

test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj

  <ItemGroup Condition="'$(TargetArchitecture)' == 'x64'">
    <PackageReference Include="libtorch-cpu-win-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Windows')) AND '$(TargetArchitecture)' == 'x64'" />
      <!-- <PackageReference Include="TorchSharp-cuda-windows" Version="0.99.5" Condition="$([MSBuild]::IsOSPlatform('Windows'))" />   -->
    <PackageReference Include="libtorch-cpu-linux-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Linux')) AND '$(TargetArchitecture)' == 'x64'" />
    <PackageReference Include="libtorch-cpu-osx-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('OSX')) AND '$(TargetArchitecture)' == 'x64'" />
  </ItemGroup>

eng/Versions.props

<LibTorchVersion>1.13.0.1</LibTorchVersion>

Here is my code : Program.cs

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.TorchSharp;

public class Program
{
    // Main method
    public static void Main(string[] args)
    {
        try
        {
            var context = new MLContext();
            context.FallbackToCpu = true;
            context.GpuDeviceId = null;

            var labels = context.Data.LoadFromEnumerable(
            new[] {
                new Label { Key = "PERSON" },
                new Label { Key = "CITY" },
                new Label { Key = "COUNTRY"  }
            });

            var dataView = context.Data.LoadFromEnumerable(
                new List<TestSingleSentenceData>(new TestSingleSentenceData[] {
                    new TestSingleSentenceData()
                    {   // Testing longer than 512 words.
                        Sentence = "Alice and Bob live in the USA",
                        Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
                    },
                     new TestSingleSentenceData()
                     {
                        Sentence = "Alice and Bob live in the USA",
                        Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
                     },
                }));
            var chain = new EstimatorChain<ITransformer>();
            var estimator = chain.Append(context.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
               .Append(context.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "outputColumn"))
               .Append(context.Transforms.Conversion.MapKeyToValue("outputColumn"));

            var transformer = estimator.Fit(dataView);
            transformer.Dispose();

            Console.WriteLine("Success!");
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }

    private class Label
    {
        public string Key { get; set; }
    }

    private class TestSingleSentenceData
    {
        public string Sentence;
        public string[] Label;
    }
}

ConsoleApp1.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

  <ItemGroup>
      <PackageReference Include="libtorch-cpu-win-x64" Version="1.13.0.1" />
      <PackageReference Include="Microsoft.ML" Version="3.0.0-preview.23511.1" />
      <PackageReference Include="Microsoft.ML.TorchSharp" Version="0.21.0-preview.23511.1" />
  </ItemGroup>

</Project>

image

Best Regards,

anrouxel

@anrouxel your solution worked for me, thx!

But now I'm getting the same issue as @florianA1:

System.Runtime.InteropServices.ExternalException: 'Expected input batch_size (9) to match target batch_size (8). Exception raised from nll_loss_nd at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\LossNLL.cpp:652 (most recent call first):

My trainingData: [{Sentence = "Alice and Bob live in the ACCO"; Label = ["0"; "0"; "0"; "0"; "0"; "0"; "TODTO";] |> List.toArray}] When I set the Sentence as "Alice and Bob live in the ACC" it works just fine.

I'm new to the whole machine learning thing, so I'm imagining that we have to transform the Sentences with something before giving it to the model to train? Or do you think this can be a bug?

Leftyx commented 1 year ago

Hello @lahbton and @Leftyx, I've managed to get your example to work, but I've just turned it into a console. The problem came from the version that the "libtorch-cpu-win-x64" or whatever you were using. Microsoft.ML 3.0.0-preview.23511.1 and Microsoft.ML.TorchSharp 0.21.0-preview.23511.1 use the version "libtorch-cpu-win-x64" or other 1.13.0.1.

test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj

  <ItemGroup Condition="'$(TargetArchitecture)' == 'x64'">
    <PackageReference Include="libtorch-cpu-win-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Windows')) AND '$(TargetArchitecture)' == 'x64'" />
      <!-- <PackageReference Include="TorchSharp-cuda-windows" Version="0.99.5" Condition="$([MSBuild]::IsOSPlatform('Windows'))" />   -->
    <PackageReference Include="libtorch-cpu-linux-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Linux')) AND '$(TargetArchitecture)' == 'x64'" />
    <PackageReference Include="libtorch-cpu-osx-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('OSX')) AND '$(TargetArchitecture)' == 'x64'" />
  </ItemGroup>

eng/Versions.props

<LibTorchVersion>1.13.0.1</LibTorchVersion>

Here is my code : Program.cs

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.TorchSharp;

public class Program
{
    // Main method
    public static void Main(string[] args)
    {
        try
        {
            var context = new MLContext();
            context.FallbackToCpu = true;
            context.GpuDeviceId = null;

            var labels = context.Data.LoadFromEnumerable(
            new[] {
                new Label { Key = "PERSON" },
                new Label { Key = "CITY" },
                new Label { Key = "COUNTRY"  }
            });

            var dataView = context.Data.LoadFromEnumerable(
                new List<TestSingleSentenceData>(new TestSingleSentenceData[] {
                    new TestSingleSentenceData()
                    {   // Testing longer than 512 words.
                        Sentence = "Alice and Bob live in the USA",
                        Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
                    },
                     new TestSingleSentenceData()
                     {
                        Sentence = "Alice and Bob live in the USA",
                        Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
                     },
                }));
            var chain = new EstimatorChain<ITransformer>();
            var estimator = chain.Append(context.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
               .Append(context.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "outputColumn"))
               .Append(context.Transforms.Conversion.MapKeyToValue("outputColumn"));

            var transformer = estimator.Fit(dataView);
            transformer.Dispose();

            Console.WriteLine("Success!");
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }

    private class Label
    {
        public string Key { get; set; }
    }

    private class TestSingleSentenceData
    {
        public string Sentence;
        public string[] Label;
    }
}

ConsoleApp1.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

  <ItemGroup>
      <PackageReference Include="libtorch-cpu-win-x64" Version="1.13.0.1" />
      <PackageReference Include="Microsoft.ML" Version="3.0.0-preview.23511.1" />
      <PackageReference Include="Microsoft.ML.TorchSharp" Version="0.21.0-preview.23511.1" />
  </ItemGroup>

</Project>

image

Best Regards,

anrouxel

anrouxel, your solutions works great. good work there :smiley:

anrouxel commented 1 year ago

Hello @Leftyx, I don't know if I can ask the question here. But I'd like to know how to export a "NameEntityRecognition" model to ONNX.

Best Regards,

anrouxel

Leftyx commented 1 year ago

Hello, I've tried the new preview version, but the error below is raised.

* Sentence = "Alice and Bob live in France" : works

* Sentence = "Alice and Bob live in Liechtenstein" : doesn't work

Am I missing something ?

Exception :

Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): Expected input batch_size (20) to match target batch_size (16).
Exception raised from nll_loss_nd_symint at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\LossNLL.cpp:682 (most recent call first):
00007FF8342FD24200007FF8342FD1E0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FF8342C54A500007FF8342C5430 c10.dll!c10::ValueError::ValueError [<unknown file> @ <unknown line number>]
00007FF8220CAE4C00007FF8220CA260 torch_cpu.dll!at::native::nll_loss_nd_symint [<unknown file> @ <unknown line number>]
00007FF822F5EC5600007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FF822F4611900007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>]
00007FF82275D45300007FF82275D270 torch_cpu.dll!at::_ops::nll_loss_nd::call [<unknown file> @ <unknown line number>]
00007FF8220C995600007FF8220C95E0 torch_cpu.dll!at::native::cross_entropy_loss_symint [<unknown file> @ <unknown line number>]
00007FF822F5CDA500007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FF822F461BB00007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>]
00007FF8229B3DDC00007FF8229B3B70 torch_cpu.dll!at::_ops::cross_entropy_loss::call [<unknown file> @ <unknown line number>]
00007FF8336FC6B600007FF8336FC4F0 LibTorchSharp.DLL!THSNN_cross_entropy [<unknown file> @ <unknown line number>]
00007FF7E432FA25 <unknown symbol address> !<unknown symbol> [<unknown file> @ <unknown line number>]

   at TorchSharp.torch.CheckForErrors()
   at TorchSharp.Modules.CrossEntropyLoss.forward(Tensor input, Tensor target)
   at Microsoft.ML.TorchSharp.NasBert.NasBertTrainer`2.NasBertTrainerBase.RunModelAndBackPropagate(List`1& inputTensors, Tensor& targetsTensor)
   at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.TrainStep(IHost host, DataViewRowCursor cursor, ValueGetter`1 labelGetter, List`1& inputTensors, List`1& targets)
   at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.Train(IHost host, IDataView input)
   at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)

Dependencies and code :

* Microsoft.ML => 3.0.0-preview.23511.1

* Microsoft.ML.TorchSharp => 0.21.0-preview.23511.1

* TorchSharp-cpu => 0.100.4
MLContext mLContext = new();
var labels = mLContext.Data.LoadFromEnumerable(
  new[] { new Label { Key = "PERSON" }, new Label { Key = "CITY" }, new Label { Key = "COUNTRY" } });

var dataView = mLContext.Data.LoadFromEnumerable(
  new List<TestSingleSentenceData>(new TestSingleSentenceData[] {
      new TestSingleSentenceData()
      {
          Sentence = "Alice and Bob live in Liechtenstein",
          //Sentence = "Alice and Bob live in France",
          Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "COUNTRY"}
      },
      new TestSingleSentenceData()
      {
          Sentence = "Alice and Bob live in the USA",
          Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
      },
  }));

var chain = new EstimatorChain<ITransformer>();
var estimator = chain.Append(mLContext.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
   .Append(mLContext.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "LabelsOut"))
   .Append(mLContext.Transforms.Conversion.MapKeyToValue("LabelsOut"));

var transformer = estimator.Fit(dataView);

Thanks, Florian

florianA1, the issue you are having is related on how words are tokenized by the EnglishRoberta tokenizer; the one used here. The word Liechtenstein is tokenized in Lie, ch, ten, stein:

image

so your labels should be defined this way:

{
    Sentence = "Alice and Bob live in Liechtenstein",
    Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "COUNTRY", "COUNTRY", "COUNTRY", "COUNTRY" }
},

4 tokens for the word.

Leftyx commented 1 year ago

Hello @Leftyx, I don't know if I can ask the question here. But I'd like to know how to export a "NameEntityRecognition" model to ONNX.

Best Regards,

anrouxel

anrouxel, I don't think you can. If you check the NerTrainer it states: Exportable to ONNX | No

image

anrouxel commented 1 year ago

Thanks @Leftyx , I guess "NameEntityRecognition" isn't compatible with the French language. That's because I'm working on a student project to create a healthcare application to store prescriptions. To store prescriptions, I need to retrieve the name of the drug, the quantity and other information. After extracting the data using an ocr. The OCR part is functional, but my problem is the NER of ML.NET in French which blocks.

Best regards

Leftyx commented 1 year ago

The OCR part is functional, but my problem is the NER of ML.NET in French which blocks.

@anrouxel , that's interesting. I am working on something similar :smiley:. Also in French. And also from data extracted with an OCR. We have a few blockers on that side too. We might be able to share experience. Going back to your problem. It seems ML.NET is using an EnglishRoberta which loads dictionary and vocabular embedded in Microsoft.ML.TorchSharp:

image

image

If there is a way to contact you we could have a private conversation and share experience, if that's ok with you.

iuribrindeiro commented 12 months ago

@Leftyx I'm a bit confused with your last response regarding the EnglishRoberta usage for this model...

I have a dataset of sentences and like @florianA1, I also thought that the data shape to train the model should simply be: For every word in a sentence, replace with "0" the words that are not your targets, and with "{TARGETLABEL}" the words that are your targets. Seems like this is not true. How would I shape the training data to follow the EnglishRoberta model?

Maybe this is not the correct place to ask this, but maybe I'm missing something on how to transform data?

qwertycho commented 12 months ago

@Leftyx Where did you find the Tokenizer GetInstance? I can't find where the text is being tokenized or where the tokenizer is created since GetInstance does not exist in TokenizerExtensions. I'm using Microsoft.ML.TorchSharp 0.21.0-preview.23511.1, Microsoft.ML 3.0.0-preview.23511.1 Schermafbeelding 2023-11-15 121217

iuribrindeiro commented 12 months ago

@Leftyx Where did you find the Tokenizer GetInstance? I can't find where the text is being tokenized or where the tokenizer is created since GetInstance does not exist in TokenizerExtensions. I'm using Microsoft.ML.TorchSharp 0.21.0-preview.23511.1, Microsoft.ML 3.0.0-preview.23511.1 Schermafbeelding 2023-11-15 121217

https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.TorchSharp/Extensions/TokenizerExtensions.cs

iuribrindeiro commented 12 months ago

@Leftyx I'm a bit confused with your last response regarding the EnglishRoberta usage for this model...

I have a dataset of sentences and like @florianA1, I also thought that the data shape to train the model should simply be: For every word in a sentence, replace with "0" the words that are not your targets, and with "{TARGETLABEL}" the words that are your targets. Seems like this is not true. How would I shape the training data to follow the EnglishRoberta model?

Maybe this is not the correct place to ask this, but maybe I'm missing something on how to transform data?

For those who are also wondering how to map your training data to look like the Test Case mentioned by @luisquintanilla here: https://github.com/dotnet/machinelearning/issues/630#issuecomment-1742221885

What I did was:

  1. Instantiate a new EnglishRoberta class with the 3 files @Leftyx mentioned here: https://github.com/dotnet/machinelearning/issues/630#issuecomment-1806867872. You can also download these from this repo: https://github.com/dotnet/machinelearning/tree/7fe293da31a05b70dddf4eba439f7bc23e3016c6/src/Microsoft.ML.TorchSharp/Resources
  2. Instantiate a Tokenizer using the EnglishRoberta instance as you can see here: https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.TorchSharp/Extensions/TokenizerExtensions.cs#L34
  3. Call the .Encode method from the instantiated Tokenizer.
  4. Look for the words you want to label on the Token.Splits and get the offset range of it.
  5. For all Token.Offsets that are inside that range, map it to your label, for all that are not, map it with "0".

With that, you should have the "Sentence" and the list of "Labels" respective to it according to the EnglishRoberta tokenizer.

There is probably a better way of mapping the training data but I just don't know. Please feel free to correct me with a better solution.

lahbton commented 12 months ago

I am also having the same issues as above.

The EnglishRoberta model is generating seemingly random tokens with text cut off across multiple token objects. This makes it very difficult to process the Label value for any matches.

To confirm, is the issue caused by having a difference between the number of Labels ('0' or custom category name) in a token from the trained data compared to the same that is generated by the EnglishRoberta model? Would a straight mapping actually work? For example, if 'Liechtenstein' is split into 'Lie', 'ch', 'ten' and 'stein', would setting these Labels to 'COUNTRY' cause all 4 split words to be seen as countries?

Is it possible to either:

Leftyx commented 11 months ago

@iuribrindeiro Sorry I couldn't reply before. But you are right, that is what happens. The only way to find how it works is to debug. There is no real documentation and I think the NER integration is not really ready for prime-time.

Leftyx commented 11 months ago

@luisquintanilla I can see you have just released ML.NET 3.0 and NER is part of the package. Any chance to have an example and some documentation on how to use it ?

maryamariyan commented 11 months ago

Any chance to have an example and some documentation on how to use it ?

+1

Seems like there is an opportunity to create better test samples here. All the issues described above were legitimate and I hit them too.

luisquintanilla commented 11 months ago

Hey folks,

Thanks for looking into this. I've created an issue to look into some of the items mentioned above and track documentation related work.

6910

lahbton commented 11 months ago

Does anyone know how to make predictions once the trained model is saved? I have a trained model that I've checked thoroughly and the labels appear to be set correctly for my categories based off the EnglishRoberta tokenization mapping process @iuribrindeiro mentioned above.

I'm getting the Splits of the input string to match up the predictions made, but the predictions don't seem to be very reliable. For example, I'm getting a colon (:) predicted for some categories where all of the trained data for that category are 20 character pieces of text. I also have a few date categories that are having odd short strings predicted.

Predict code:

var context = new MLContext()
 {
      FallbackToCpu = true,
      GpuDeviceId = 0
 };

var trainedModel = context.Model.Load(GetOutputFilePath(), out DataViewSchema _);

var engine = context.Model.CreatePredictionEngine<IndividualTokenModel, PredictionModel>(trainedModel);

PredictionModel predictions = engine.Predict(new IndividualTokenModel { Sentence = request.InputValue });
michaelgsharp commented 10 months ago

OK, Some of the issues raised here, especially around sentence/vs token length should be addressed by this PR. https://github.com/dotnet/machinelearning/pull/6928/checks. Now, when a word gets split into more than one token, ML.NET will automatically handle that so you don't need to worry about it anymore.

Once this PR goes in we will get more examples out. This PR also includes a sample key/data file as well, and there is code to do a full run with those files (though skipped by default in CI cause its way to big to try and run there).

Leftyx commented 9 months ago

@michaelgsharp Any chance to see the more examples for the NER ? Thanks

lahbton commented 8 months ago

@michaelgsharp - Thanks for the update with #6928

Is there any ETA on when this may be released?

MohamedQando commented 6 months ago

Hello folks, I recently tried to use the NER model through the model builder but I'm always get a very bad accuracy from the model and i don't know why my entity key file contains PERSON ORGANIZATION LOCATION

and my data file format is like this:

Charlie works at Microsoft in San Francisco. PERSON 0 LOCATION ORGANIZATION 0 LOCATION 0

Am I missing something?

or can any one give me a simple dataset to this the NER scenario thank you

Leftyx commented 6 months ago

Hello folks, I recently tried to use the NER model through the model builder but I'm always get a very bad accuracy from the model and i don't know why my entity key file contains PERSON ORGANIZATION LOCATION

and my data file format is like this:

Charlie works at Microsoft in San Francisco. PERSON 0 LOCATION ORGANIZATION 0 LOCATION 0

Am I missing something?

or can any one give me a simple dataset to this the NER scenario thank you

@MohamedQando: Could you share an example or some code ? Maybe we can help.

MohamedQando commented 6 months ago

Screenshot 2024-05-06 090554 im trying to train a sample ner model using the model builder but the model accuracy is very low and always extracts a wrong feature image accuracy is 0% i notice the model always adds an empty string as an entity and this empty string causes all the issue

@Leftyx

MohamedKando commented 4 months ago

can any one provide me a sample dataset so i can test the model Thx @Leftyx

Leftyx commented 4 months ago

can any one provide me a sample dataset so i can test the model Thx @Leftyx

Hi @MohamedKando , I am sorry I cannot help you there. I have stopped using ML.NET for NER as it is not ready yet and after a few months (years?) of waiting for something usable I have decided to give up. I don't think NER will ever be ready. Maybe you can ask @luisquintanilla for some help. He promised samples almost 1 year ago but so far I haven't seen much.

MaxAkbar commented 4 months ago

Yeah same here, I am now using the Phi-3 models for that.

ericstj commented 3 months ago

@michaelgsharp - Thanks for the update with #6928

Is there any ETA on when this may be released?

I believe this went out in ML.NET 3.0.1. https://www.nuget.org/packages/Microsoft.ML/3.0.1 https://www.nuget.org/packages/Microsoft.ML.TorchSharp/0.21.1

Pinging @michaelgsharp for the state of NER and @JakeRadMSFT for Model Builder.

michaelgsharp commented 3 months ago

It went out with ML.NET 3.0.1. I don't remember seeing any issues with extra entities or strings. I can take a look into it though in the next couple of days. I wonder if it has something to do with Model builder itself.

@JakeRadMSFT @luisquintanilla @LittleLittleCloud do you know the status of NER in model builder? have you seen any issues with it?