Open MaxAkbar opened 6 years ago
NER is named entity recognition, meaning we extract entities from some text. If you want Intent, then that is something else. NLP natural language processing can understand human language read\write\spoken. That said, NLP groups the smaller parts like NER, Intent, sentiment (something ML.net does well now), relations, and more. This topic was specific to NER.
Azure has LUIS services, which I have used to automate customer service requests and direct users appropriately. Azure LUIS service does intent\interpret user goals extremely well.
+1 to NER in ML.NET, and also a tool to apply custom labels to texts would be awesome, allowing the whole process to be done within one tool.
For our use case, we need to use quite a few custom labels. Here's an example:
Thanks all for the replies. As far as tagging tools, what do you use today? I see AWS Sagemaker and LUIS were mentioned.
Tagging: AWS Sagemaker GroundTruth Running: AWS Comprehend Custom Entity Recognition
+1, wanted to chime in with another use case. I'm looking to detect work item numbers in text. These can be in GitHub-like #-format, or JIRA-like PROJ-1234 format, but often they are formatted in different ways as they are input by humans.
Anonymized examples and the expected results (scores not shown):
#1234: UI polish on the Login screen
-> ["1234"]Work item 2234: Fix issue with forgot password
-> ["2234"]Regression testing on 457 found: User could interrupt upload when...
(snipped) -> ["457"]Re-test items #2345, 2346, and 2347
-> ["2345", "2346", "2347"]Begin scaffolding work for PROJ-5678
-> ["PROJ-5678"]#56: Upgrade to Angular 14
-> ["56", "14"] - note: I'd hope that "14" here has a lower scoreExamples of false positives that I'd hope would have lower scores:
Design session #2
Completing the Zone 18 Recap report...
Analyzing HTML for Section 508 compliance
As you can imagine, a simple regex approach gets a good ways there, but doesn't have scores. Simple number recognition isn't really an ideal solution either. Would be nice to have the ability to train custom models for this somehow. Hope this helps your planning, even if this is not a supported use case.
@paulirwin From what I understand NER doesn't do that. I suppose you could use a RegEx during entity selection, and I believe the Standford NER has this, but I haven't dug too deep. Have a look at recognizers and maybe that will help.: https://github.com/microsoft/Recognizers-Text.
@MaxAkbar The 7-class Stanford NER detects money, percentages, and dates/times, and the Azure Text Analytics API can detect quantities, phone numbers, and IP addresses as entities, so work items (or more abstractly, alphanumeric identifiers) doesn't seem all that far-fetched for NER. But just wanted to throw my use case in the hat in case it was interesting 😄
@paulirwin Actually Stanford NER does allow custom entities to be trained. In my earlier posts, I have a link somewhere. I have done some testing with it but it's not easy to follow :) documentation. maybe now they have something a little better.
What I wanted to say about the Text Recognizers is that you can follow their example and create your own, modeling after the same pattern.
You are correct to have as many use cases and see if this is something that can be done.
I agree that for ML NER to be most useful it should be trainable. If there would be only fixed-type extractors (money, percentage, email address...) I believe there are better (deterministic) ways to do it. For example, IP addresses and emails can and should be extracted with a regex that also validates the format. Although, I do not see anything wrong with having more entity recognizers which are not based on machine learning. Anything helps, even a tested libary for IP address search with regex.
My no.1 use case is places in general, including inflections in different languages. Especially places which I can not detect based on a list of words, like countries. For my use case "places" means something to search on Map apps (cities, streets, regions, landmarks...).
Speculation (not my actual use cases)
I would use NER to extract data for a reservation from unformatted free text, such as from a text in an email, such as time, date, number of people, name, room, phone number, extra wishes, comments and so on.. it has become pretty quiet around this topic here. Is that even still on the agenda?
After 4 years since the first post we are still waiting for NER in ML.NET. Any news ? If anyone interested I put together some code to use Hugging Face models, exported in ONNX and used with ML.NET. It seems to work quite well.
NER support was added to ML.NET as part of this PR.
https://github.com/dotnet/machinelearning/pull/6760
Here is a sample showing how to use the API
Good morning @luisquintanilla
I haven't found an doc / tutorials / samples about NER in ML.NET. I guess that's part of the roadmap when it states that documentation is to be updated, but then a couple questions arise.
In the sample you point, it checks the detection of person / city / country entities. I'm assuming one could use custom entities, the ones that fit the dataset to train the model with, since in the background it's a MultiClassClassifier, right? Any ideas on when there will be doc -and, ideally, samples with custom labels- for NER?
Thank you.
Hi @AAD-eNavarro,
We don't have a timeline on the full tutorial / sample.
You are right though that you should be able to use custom entities as part of your dataset. Let us know if you run into issues though.
@luisquintanilla Thank you for this 🙏. I have a few questions. Is there a limit to the length of the sentence? Also, how many sentences should we provide to help the NER extract entities? In your unit test, you have the same sentence, should it not be different?
@luisquintanilla Thank you for this 🙏. I have a few questions. Is there a limit to the length of the sentence? Also, how many sentences should we provide to help the NER extract entities? In your unit test, you have the same sentence, should it not be different?
Is there a limit to the length of the sentence - 512 tokens
how many sentences should we provide to help the NER extract entities - Good question. there's not hard cutoff. The more examples the better but also make sure those samples are representative.
In your unit test, you have the same sentence, should it not be different? - Yes. That should be fixed. Thanks for catching that.
Hi @luisquintanilla
I'm trying to replicate the test you mention, but even when I use the preview.23266.6 version of Microsoft.ML, but I get three errors: The name 'ML' does not exist in the current context -> Wherever ML is used 'SchemaShape' does not contain a definition for 'Create' -> on line 60 The name 'TestEstimatorCore' does not exist in the current context -> on line 71
I'm guessing it's because I haven't been able to find the Microsoft.ML.RunTests package, at least the two last errors look like it. I can't get mi mind around the first one. How can I run NER without the tests package, as if it was for production?
(PS: I'm sick today, I may be missing an obvious point. If that's the case, please point it out for me, and my apologies)
Hello, I've tried the new preview version, but the error below is raised.
Am I missing something ?
Exception :
Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): Expected input batch_size (20) to match target batch_size (16).
Exception raised from nll_loss_nd_symint at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\LossNLL.cpp:682 (most recent call first):
00007FF8342FD24200007FF8342FD1E0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FF8342C54A500007FF8342C5430 c10.dll!c10::ValueError::ValueError [<unknown file> @ <unknown line number>]
00007FF8220CAE4C00007FF8220CA260 torch_cpu.dll!at::native::nll_loss_nd_symint [<unknown file> @ <unknown line number>]
00007FF822F5EC5600007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FF822F4611900007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>]
00007FF82275D45300007FF82275D270 torch_cpu.dll!at::_ops::nll_loss_nd::call [<unknown file> @ <unknown line number>]
00007FF8220C995600007FF8220C95E0 torch_cpu.dll!at::native::cross_entropy_loss_symint [<unknown file> @ <unknown line number>]
00007FF822F5CDA500007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>]
00007FF822F461BB00007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>]
00007FF8229B3DDC00007FF8229B3B70 torch_cpu.dll!at::_ops::cross_entropy_loss::call [<unknown file> @ <unknown line number>]
00007FF8336FC6B600007FF8336FC4F0 LibTorchSharp.DLL!THSNN_cross_entropy [<unknown file> @ <unknown line number>]
00007FF7E432FA25 <unknown symbol address> !<unknown symbol> [<unknown file> @ <unknown line number>]
at TorchSharp.torch.CheckForErrors()
at TorchSharp.Modules.CrossEntropyLoss.forward(Tensor input, Tensor target)
at Microsoft.ML.TorchSharp.NasBert.NasBertTrainer`2.NasBertTrainerBase.RunModelAndBackPropagate(List`1& inputTensors, Tensor& targetsTensor)
at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.TrainStep(IHost host, DataViewRowCursor cursor, ValueGetter`1 labelGetter, List`1& inputTensors, List`1& targets)
at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.Train(IHost host, IDataView input)
at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.Fit(IDataView input)
at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
Dependencies and code :
MLContext mLContext = new();
var labels = mLContext.Data.LoadFromEnumerable(
new[] { new Label { Key = "PERSON" }, new Label { Key = "CITY" }, new Label { Key = "COUNTRY" } });
var dataView = mLContext.Data.LoadFromEnumerable(
new List<TestSingleSentenceData>(new TestSingleSentenceData[] {
new TestSingleSentenceData()
{
Sentence = "Alice and Bob live in Liechtenstein",
//Sentence = "Alice and Bob live in France",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "COUNTRY"}
},
new TestSingleSentenceData()
{
Sentence = "Alice and Bob live in the USA",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
},
}));
var chain = new EstimatorChain<ITransformer>();
var estimator = chain.Append(mLContext.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
.Append(mLContext.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "LabelsOut"))
.Append(mLContext.Transforms.Conversion.MapKeyToValue("LabelsOut"));
var transformer = estimator.Fit(dataView);
Thanks, Florian
Hi everyone
I've been keeping an eye on this feature for a while & have created a very simple test project to use the code from the TestSimpleNer method in the NerTests class to see if it would run through OK, but I am seeing an error at the point of creating the transformer object: Exception raised at: var transformer = estimator.Fit(dataView); Error message: Field not found: 'TorchSharp.torch.CUDA'
The error seems to indicate that a field definition is missing, but the name appears to be a TorchSharp component?
I only want to use CPU (& not GPU), so have the following 3 packages: Microsoft.ML (3.0.0-preview.23511.1) Microsoft.ML.TorchSharp (0.21.0-preview.23511.1) Torchsharp-cpu (0.101.1)
I posted on StackOverflow but I'm assuming that there is little knowledge out there because of how new this feature is: https://stackoverflow.com/questions/77440001/cuda-issue-with-ner-named-entity-recognition-for-ml-predictions
Any help would be greatly appreciated.
Hi everyone
I've been keeping an eye on this feature for a while & have created a very simple test project to use the code from the TestSimpleNer method in the NerTests class to see if it would run through OK, but I am seeing an error at the point of creating the transformer object: Exception raised at: var transformer = estimator.Fit(dataView); Error message: Field not found: 'TorchSharp.torch.CUDA'
The error seems to indicate that a field definition is missing, but the name appears to be a TorchSharp component?
I only want to use CPU (& not GPU), so have the following 3 packages: Microsoft.ML (3.0.0-preview.23511.1) Microsoft.ML.TorchSharp (0.21.0-preview.23511.1) Torchsharp-cpu (0.101.1)
I posted on StackOverflow but I'm assuming that there is little knowledge out there because of how new this feature is: https://stackoverflow.com/questions/77440001/cuda-issue-with-ner-named-entity-recognition-for-ml-predictions
Any help would be greatly appreciated.
I have tried to reply with the best of my knowledge. You might not like the answer though :-)
Thank you SO much for the response. I'll take a further look and see if I can get the demo working with your tips.
I've had several people look at this and posted in several places, but it's been silent up to now. It seems as though this new ML feature is too new (with very little documentation) for their to be much knowledge out there.
Thanks again!
On Fri, Nov 10, 2023 at 3:09 PM Leftyx @.***> wrote:
Hi everyone
I've been keeping an eye on this feature for a while & have created a very simple test project to use the code from the TestSimpleNer method in the NerTests class to see if it would run through OK, but I am seeing an error at the point of creating the transformer object: Exception raised at: var transformer = estimator.Fit(dataView); Error message: Field not found: 'TorchSharp.torch.CUDA'
The error seems to indicate that a field definition is missing, but the name appears to be a TorchSharp component?
I only want to use CPU (& not GPU), so have the following 3 packages: Microsoft.ML (3.0.0-preview.23511.1) Microsoft.ML.TorchSharp (0.21.0-preview.23511.1) Torchsharp-cpu (0.101.1)
I posted on StackOverflow but I'm assuming that there is little knowledge out there because of how new this feature is: https://stackoverflow.com/questions/77440001/cuda-issue-with-ner-named-entity-recognition-for-ml-predictions
Any help would be greatly appreciated.
I have tried to reply with the best of my knowledge. You might not like the answer though :-)
— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning/issues/630#issuecomment-1806370636, or unsubscribe https://github.com/notifications/unsubscribe-auth/BD34WZ3EGRIT2LJSXUTUOGTYD2CXNAVCNFSM4FNOLYM2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBQGYZTOMBWGM3A . You are receiving this because you commented.Message ID: @.***>
Hello @lahbton and @Leftyx, I've managed to get your example to work, but I've just turned it into a console. The problem came from the version that the "libtorch-cpu-win-x64" or whatever you were using. Microsoft.ML 3.0.0-preview.23511.1 and Microsoft.ML.TorchSharp 0.21.0-preview.23511.1 use the version "libtorch-cpu-win-x64" or other 1.13.0.1.
test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj
<ItemGroup Condition="'$(TargetArchitecture)' == 'x64'">
<PackageReference Include="libtorch-cpu-win-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Windows')) AND '$(TargetArchitecture)' == 'x64'" />
<!-- <PackageReference Include="TorchSharp-cuda-windows" Version="0.99.5" Condition="$([MSBuild]::IsOSPlatform('Windows'))" /> -->
<PackageReference Include="libtorch-cpu-linux-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Linux')) AND '$(TargetArchitecture)' == 'x64'" />
<PackageReference Include="libtorch-cpu-osx-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('OSX')) AND '$(TargetArchitecture)' == 'x64'" />
</ItemGroup>
eng/Versions.props
<LibTorchVersion>1.13.0.1</LibTorchVersion>
Here is my code : Program.cs
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.TorchSharp;
public class Program
{
// Main method
public static void Main(string[] args)
{
try
{
var context = new MLContext();
context.FallbackToCpu = true;
context.GpuDeviceId = null;
var labels = context.Data.LoadFromEnumerable(
new[] {
new Label { Key = "PERSON" },
new Label { Key = "CITY" },
new Label { Key = "COUNTRY" }
});
var dataView = context.Data.LoadFromEnumerable(
new List<TestSingleSentenceData>(new TestSingleSentenceData[] {
new TestSingleSentenceData()
{ // Testing longer than 512 words.
Sentence = "Alice and Bob live in the USA",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
},
new TestSingleSentenceData()
{
Sentence = "Alice and Bob live in the USA",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"}
},
}));
var chain = new EstimatorChain<ITransformer>();
var estimator = chain.Append(context.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
.Append(context.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "outputColumn"))
.Append(context.Transforms.Conversion.MapKeyToValue("outputColumn"));
var transformer = estimator.Fit(dataView);
transformer.Dispose();
Console.WriteLine("Success!");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
private class Label
{
public string Key { get; set; }
}
private class TestSingleSentenceData
{
public string Sentence;
public string[] Label;
}
}
ConsoleApp1.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net7.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="libtorch-cpu-win-x64" Version="1.13.0.1" />
<PackageReference Include="Microsoft.ML" Version="3.0.0-preview.23511.1" />
<PackageReference Include="Microsoft.ML.TorchSharp" Version="0.21.0-preview.23511.1" />
</ItemGroup>
</Project>
Best Regards,
anrouxel
Hello @lahbton and @Leftyx, I've managed to get your example to work, but I've just turned it into a console. The problem came from the version that the "libtorch-cpu-win-x64" or whatever you were using. Microsoft.ML 3.0.0-preview.23511.1 and Microsoft.ML.TorchSharp 0.21.0-preview.23511.1 use the version "libtorch-cpu-win-x64" or other 1.13.0.1.
test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj
<ItemGroup Condition="'$(TargetArchitecture)' == 'x64'"> <PackageReference Include="libtorch-cpu-win-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Windows')) AND '$(TargetArchitecture)' == 'x64'" /> <!-- <PackageReference Include="TorchSharp-cuda-windows" Version="0.99.5" Condition="$([MSBuild]::IsOSPlatform('Windows'))" /> --> <PackageReference Include="libtorch-cpu-linux-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Linux')) AND '$(TargetArchitecture)' == 'x64'" /> <PackageReference Include="libtorch-cpu-osx-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('OSX')) AND '$(TargetArchitecture)' == 'x64'" /> </ItemGroup>
eng/Versions.props
<LibTorchVersion>1.13.0.1</LibTorchVersion>
Here is my code : Program.cs
using Microsoft.ML; using Microsoft.ML.Data; using Microsoft.ML.TorchSharp; public class Program { // Main method public static void Main(string[] args) { try { var context = new MLContext(); context.FallbackToCpu = true; context.GpuDeviceId = null; var labels = context.Data.LoadFromEnumerable( new[] { new Label { Key = "PERSON" }, new Label { Key = "CITY" }, new Label { Key = "COUNTRY" } }); var dataView = context.Data.LoadFromEnumerable( new List<TestSingleSentenceData>(new TestSingleSentenceData[] { new TestSingleSentenceData() { // Testing longer than 512 words. Sentence = "Alice and Bob live in the USA", Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"} }, new TestSingleSentenceData() { Sentence = "Alice and Bob live in the USA", Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"} }, })); var chain = new EstimatorChain<ITransformer>(); var estimator = chain.Append(context.Transforms.Conversion.MapValueToKey("Label", keyData: labels)) .Append(context.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "outputColumn")) .Append(context.Transforms.Conversion.MapKeyToValue("outputColumn")); var transformer = estimator.Fit(dataView); transformer.Dispose(); Console.WriteLine("Success!"); } catch (Exception ex) { Console.WriteLine($"Error: {ex.Message}"); } } private class Label { public string Key { get; set; } } private class TestSingleSentenceData { public string Sentence; public string[] Label; } }
ConsoleApp1.csproj
<Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <OutputType>Exe</OutputType> <TargetFramework>net7.0</TargetFramework> <ImplicitUsings>enable</ImplicitUsings> <Nullable>enable</Nullable> </PropertyGroup> <ItemGroup> <PackageReference Include="libtorch-cpu-win-x64" Version="1.13.0.1" /> <PackageReference Include="Microsoft.ML" Version="3.0.0-preview.23511.1" /> <PackageReference Include="Microsoft.ML.TorchSharp" Version="0.21.0-preview.23511.1" /> </ItemGroup> </Project>
Best Regards,
anrouxel
@anrouxel your solution worked for me, thx!
But now I'm getting the same issue as @florianA1:
System.Runtime.InteropServices.ExternalException: 'Expected input batch_size (9) to match target batch_size (8). Exception raised from nll_loss_nd at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\LossNLL.cpp:652 (most recent call first):
My trainingData: [{Sentence = "Alice and Bob live in the ACCO"; Label = ["0"; "0"; "0"; "0"; "0"; "0"; "TODTO";] |> List.toArray}]
When I set the Sentence
as "Alice and Bob live in the ACC"
it works just fine.
I'm new to the whole machine learning thing, so I'm imagining that we have to transform the Sentence
s with something before giving it to the model to train? Or do you think this can be a bug?
Hello @lahbton and @Leftyx, I've managed to get your example to work, but I've just turned it into a console. The problem came from the version that the "libtorch-cpu-win-x64" or whatever you were using. Microsoft.ML 3.0.0-preview.23511.1 and Microsoft.ML.TorchSharp 0.21.0-preview.23511.1 use the version "libtorch-cpu-win-x64" or other 1.13.0.1.
test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj
<ItemGroup Condition="'$(TargetArchitecture)' == 'x64'"> <PackageReference Include="libtorch-cpu-win-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Windows')) AND '$(TargetArchitecture)' == 'x64'" /> <!-- <PackageReference Include="TorchSharp-cuda-windows" Version="0.99.5" Condition="$([MSBuild]::IsOSPlatform('Windows'))" /> --> <PackageReference Include="libtorch-cpu-linux-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('Linux')) AND '$(TargetArchitecture)' == 'x64'" /> <PackageReference Include="libtorch-cpu-osx-x64" Version="$(LibTorchVersion)" Condition="$([MSBuild]::IsOSPlatform('OSX')) AND '$(TargetArchitecture)' == 'x64'" /> </ItemGroup>
eng/Versions.props
<LibTorchVersion>1.13.0.1</LibTorchVersion>
Here is my code : Program.cs
using Microsoft.ML; using Microsoft.ML.Data; using Microsoft.ML.TorchSharp; public class Program { // Main method public static void Main(string[] args) { try { var context = new MLContext(); context.FallbackToCpu = true; context.GpuDeviceId = null; var labels = context.Data.LoadFromEnumerable( new[] { new Label { Key = "PERSON" }, new Label { Key = "CITY" }, new Label { Key = "COUNTRY" } }); var dataView = context.Data.LoadFromEnumerable( new List<TestSingleSentenceData>(new TestSingleSentenceData[] { new TestSingleSentenceData() { // Testing longer than 512 words. Sentence = "Alice and Bob live in the USA", Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"} }, new TestSingleSentenceData() { Sentence = "Alice and Bob live in the USA", Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"} }, })); var chain = new EstimatorChain<ITransformer>(); var estimator = chain.Append(context.Transforms.Conversion.MapValueToKey("Label", keyData: labels)) .Append(context.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "outputColumn")) .Append(context.Transforms.Conversion.MapKeyToValue("outputColumn")); var transformer = estimator.Fit(dataView); transformer.Dispose(); Console.WriteLine("Success!"); } catch (Exception ex) { Console.WriteLine($"Error: {ex.Message}"); } } private class Label { public string Key { get; set; } } private class TestSingleSentenceData { public string Sentence; public string[] Label; } }
ConsoleApp1.csproj
<Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <OutputType>Exe</OutputType> <TargetFramework>net7.0</TargetFramework> <ImplicitUsings>enable</ImplicitUsings> <Nullable>enable</Nullable> </PropertyGroup> <ItemGroup> <PackageReference Include="libtorch-cpu-win-x64" Version="1.13.0.1" /> <PackageReference Include="Microsoft.ML" Version="3.0.0-preview.23511.1" /> <PackageReference Include="Microsoft.ML.TorchSharp" Version="0.21.0-preview.23511.1" /> </ItemGroup> </Project>
Best Regards,
anrouxel
anrouxel, your solutions works great. good work there :smiley:
Hello @Leftyx, I don't know if I can ask the question here. But I'd like to know how to export a "NameEntityRecognition" model to ONNX.
Best Regards,
anrouxel
Hello, I've tried the new preview version, but the error below is raised.
* Sentence = "Alice and Bob live in France" : works * Sentence = "Alice and Bob live in Liechtenstein" : doesn't work
Am I missing something ?
Exception :
Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): Expected input batch_size (20) to match target batch_size (16). Exception raised from nll_loss_nd_symint at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\LossNLL.cpp:682 (most recent call first): 00007FF8342FD24200007FF8342FD1E0 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>] 00007FF8342C54A500007FF8342C5430 c10.dll!c10::ValueError::ValueError [<unknown file> @ <unknown line number>] 00007FF8220CAE4C00007FF8220CA260 torch_cpu.dll!at::native::nll_loss_nd_symint [<unknown file> @ <unknown line number>] 00007FF822F5EC5600007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>] 00007FF822F4611900007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>] 00007FF82275D45300007FF82275D270 torch_cpu.dll!at::_ops::nll_loss_nd::call [<unknown file> @ <unknown line number>] 00007FF8220C995600007FF8220C95E0 torch_cpu.dll!at::native::cross_entropy_loss_symint [<unknown file> @ <unknown line number>] 00007FF822F5CDA500007FF822F5AAC0 torch_cpu.dll!at::compositeimplicitautograd::where [<unknown file> @ <unknown line number>] 00007FF822F461BB00007FF822F00E20 torch_cpu.dll!at::compositeimplicitautograd::broadcast_to_symint [<unknown file> @ <unknown line number>] 00007FF8229B3DDC00007FF8229B3B70 torch_cpu.dll!at::_ops::cross_entropy_loss::call [<unknown file> @ <unknown line number>] 00007FF8336FC6B600007FF8336FC4F0 LibTorchSharp.DLL!THSNN_cross_entropy [<unknown file> @ <unknown line number>] 00007FF7E432FA25 <unknown symbol address> !<unknown symbol> [<unknown file> @ <unknown line number>] at TorchSharp.torch.CheckForErrors() at TorchSharp.Modules.CrossEntropyLoss.forward(Tensor input, Tensor target) at Microsoft.ML.TorchSharp.NasBert.NasBertTrainer`2.NasBertTrainerBase.RunModelAndBackPropagate(List`1& inputTensors, Tensor& targetsTensor) at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.TrainStep(IHost host, DataViewRowCursor cursor, ValueGetter`1 labelGetter, List`1& inputTensors, List`1& targets) at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.TrainerBase.Train(IHost host, IDataView input) at Microsoft.ML.TorchSharp.TorchSharpBaseTrainer`2.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
Dependencies and code :
* Microsoft.ML => 3.0.0-preview.23511.1 * Microsoft.ML.TorchSharp => 0.21.0-preview.23511.1 * TorchSharp-cpu => 0.100.4
MLContext mLContext = new(); var labels = mLContext.Data.LoadFromEnumerable( new[] { new Label { Key = "PERSON" }, new Label { Key = "CITY" }, new Label { Key = "COUNTRY" } }); var dataView = mLContext.Data.LoadFromEnumerable( new List<TestSingleSentenceData>(new TestSingleSentenceData[] { new TestSingleSentenceData() { Sentence = "Alice and Bob live in Liechtenstein", //Sentence = "Alice and Bob live in France", Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "COUNTRY"} }, new TestSingleSentenceData() { Sentence = "Alice and Bob live in the USA", Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "0", "COUNTRY"} }, })); var chain = new EstimatorChain<ITransformer>(); var estimator = chain.Append(mLContext.Transforms.Conversion.MapValueToKey("Label", keyData: labels)) .Append(mLContext.MulticlassClassification.Trainers.NameEntityRecognition(outputColumnName: "LabelsOut")) .Append(mLContext.Transforms.Conversion.MapKeyToValue("LabelsOut")); var transformer = estimator.Fit(dataView);
Thanks, Florian
florianA1, the issue you are having is related on how words are tokenized by the EnglishRoberta
tokenizer; the one used here. The word Liechtenstein
is tokenized in Lie
, ch
, ten
, stein
:
so your labels should be defined this way:
{
Sentence = "Alice and Bob live in Liechtenstein",
Label = new string[]{"PERSON", "0", "PERSON", "0", "0", "COUNTRY", "COUNTRY", "COUNTRY", "COUNTRY" }
},
4 tokens for the word.
Hello @Leftyx, I don't know if I can ask the question here. But I'd like to know how to export a "NameEntityRecognition" model to ONNX.
Best Regards,
anrouxel
anrouxel, I don't think you can. If you check the NerTrainer
it states: Exportable to ONNX | No
Thanks @Leftyx , I guess "NameEntityRecognition" isn't compatible with the French language. That's because I'm working on a student project to create a healthcare application to store prescriptions. To store prescriptions, I need to retrieve the name of the drug, the quantity and other information. After extracting the data using an ocr. The OCR part is functional, but my problem is the NER of ML.NET in French which blocks.
Best regards
The OCR part is functional, but my problem is the NER of ML.NET in French which blocks.
@anrouxel , that's interesting. I am working on something similar :smiley:. Also in French. And also from data extracted with an OCR. We have a few blockers on that side too. We might be able to share experience.
Going back to your problem. It seems ML.NET is using an EnglishRoberta
which loads dictionary and vocabular embedded in Microsoft.ML.TorchSharp
:
If there is a way to contact you we could have a private conversation and share experience, if that's ok with you.
@Leftyx I'm a bit confused with your last response regarding the EnglishRoberta
usage for this model...
I have a dataset of sentences and like @florianA1, I also thought that the data shape to train the model should simply be:
For every word in a sentence, replace with "0" the words that are not your targets, and with "{TARGETLABEL}" the words that are your targets. Seems like this is not true. How would I shape the training data to follow the EnglishRoberta
model?
Maybe this is not the correct place to ask this, but maybe I'm missing something on how to transform data?
@Leftyx Where did you find the Tokenizer GetInstance? I can't find where the text is being tokenized or where the tokenizer is created since GetInstance does not exist in TokenizerExtensions. I'm using Microsoft.ML.TorchSharp 0.21.0-preview.23511.1
,
Microsoft.ML 3.0.0-preview.23511.1
@Leftyx Where did you find the Tokenizer GetInstance? I can't find where the text is being tokenized or where the tokenizer is created since GetInstance does not exist in TokenizerExtensions. I'm using
Microsoft.ML.TorchSharp 0.21.0-preview.23511.1
,Microsoft.ML 3.0.0-preview.23511.1
@Leftyx I'm a bit confused with your last response regarding the
EnglishRoberta
usage for this model...I have a dataset of sentences and like @florianA1, I also thought that the data shape to train the model should simply be: For every word in a sentence, replace with "0" the words that are not your targets, and with "{TARGETLABEL}" the words that are your targets. Seems like this is not true. How would I shape the training data to follow the
EnglishRoberta
model?Maybe this is not the correct place to ask this, but maybe I'm missing something on how to transform data?
For those who are also wondering how to map your training data to look like the Test Case mentioned by @luisquintanilla here: https://github.com/dotnet/machinelearning/issues/630#issuecomment-1742221885
What I did was:
EnglishRoberta
class with the 3 files @Leftyx mentioned here: https://github.com/dotnet/machinelearning/issues/630#issuecomment-1806867872. You can also download these from this repo: https://github.com/dotnet/machinelearning/tree/7fe293da31a05b70dddf4eba439f7bc23e3016c6/src/Microsoft.ML.TorchSharp/ResourcesTokenizer
using the EnglishRoberta
instance as you can see here: https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.TorchSharp/Extensions/TokenizerExtensions.cs#L34.Encode
method from the instantiated Tokenizer
.Token.Splits
and get the offset range of it.Token.Offsets
that are inside that range, map it to your label, for all that are not, map it with "0"
.With that, you should have the "Sentence" and the list of "Labels" respective to it according to the EnglishRoberta
tokenizer.
There is probably a better way of mapping the training data but I just don't know. Please feel free to correct me with a better solution.
I am also having the same issues as above.
The EnglishRoberta model is generating seemingly random tokens with text cut off across multiple token objects. This makes it very difficult to process the Label value for any matches.
To confirm, is the issue caused by having a difference between the number of Labels ('0' or custom category name) in a token from the trained data compared to the same that is generated by the EnglishRoberta model? Would a straight mapping actually work? For example, if 'Liechtenstein' is split into 'Lie', 'ch', 'ten' and 'stein', would setting these Labels to 'COUNTRY' cause all 4 split words to be seen as countries?
Is it possible to either:
@iuribrindeiro Sorry I couldn't reply before. But you are right, that is what happens. The only way to find how it works is to debug. There is no real documentation and I think the NER integration is not really ready for prime-time.
@luisquintanilla I can see you have just released ML.NET 3.0 and NER is part of the package. Any chance to have an example and some documentation on how to use it ?
Any chance to have an example and some documentation on how to use it ?
+1
Seems like there is an opportunity to create better test samples here. All the issues described above were legitimate and I hit them too.
Hey folks,
Thanks for looking into this. I've created an issue to look into some of the items mentioned above and track documentation related work.
Does anyone know how to make predictions once the trained model is saved? I have a trained model that I've checked thoroughly and the labels appear to be set correctly for my categories based off the EnglishRoberta tokenization mapping process @iuribrindeiro mentioned above.
I'm getting the Splits of the input string to match up the predictions made, but the predictions don't seem to be very reliable. For example, I'm getting a colon (:) predicted for some categories where all of the trained data for that category are 20 character pieces of text. I also have a few date categories that are having odd short strings predicted.
Predict code:
var context = new MLContext()
{
FallbackToCpu = true,
GpuDeviceId = 0
};
var trainedModel = context.Model.Load(GetOutputFilePath(), out DataViewSchema _);
var engine = context.Model.CreatePredictionEngine<IndividualTokenModel, PredictionModel>(trainedModel);
PredictionModel predictions = engine.Predict(new IndividualTokenModel { Sentence = request.InputValue });
OK, Some of the issues raised here, especially around sentence/vs token length should be addressed by this PR. https://github.com/dotnet/machinelearning/pull/6928/checks. Now, when a word gets split into more than one token, ML.NET will automatically handle that so you don't need to worry about it anymore.
Once this PR goes in we will get more examples out. This PR also includes a sample key/data file as well, and there is code to do a full run with those files (though skipped by default in CI cause its way to big to try and run there).
@michaelgsharp Any chance to see the more examples for the NER ? Thanks
@michaelgsharp - Thanks for the update with #6928
Is there any ETA on when this may be released?
Hello folks, I recently tried to use the NER model through the model builder but I'm always get a very bad accuracy from the model and i don't know why my entity key file contains PERSON ORGANIZATION LOCATION
and my data file format is like this:
Charlie works at Microsoft in San Francisco. PERSON 0 LOCATION ORGANIZATION 0 LOCATION 0
Am I missing something?
or can any one give me a simple dataset to this the NER scenario thank you
Hello folks, I recently tried to use the NER model through the model builder but I'm always get a very bad accuracy from the model and i don't know why my entity key file contains PERSON ORGANIZATION LOCATION
and my data file format is like this:
Charlie works at Microsoft in San Francisco. PERSON 0 LOCATION ORGANIZATION 0 LOCATION 0
Am I missing something?
or can any one give me a simple dataset to this the NER scenario thank you
@MohamedQando: Could you share an example or some code ? Maybe we can help.
im trying to train a sample ner model using the model builder but the model accuracy is very low and always extracts a wrong feature accuracy is 0% i notice the model always adds an empty string as an entity and this empty string causes all the issue
@Leftyx
can any one provide me a sample dataset so i can test the model Thx @Leftyx
can any one provide me a sample dataset so i can test the model Thx @Leftyx
Hi @MohamedKando , I am sorry I cannot help you there. I have stopped using ML.NET for NER as it is not ready yet and after a few months (years?) of waiting for something usable I have decided to give up. I don't think NER will ever be ready. Maybe you can ask @luisquintanilla for some help. He promised samples almost 1 year ago but so far I haven't seen much.
Yeah same here, I am now using the Phi-3 models for that.
@michaelgsharp - Thanks for the update with #6928
Is there any ETA on when this may be released?
I believe this went out in ML.NET 3.0.1. https://www.nuget.org/packages/Microsoft.ML/3.0.1 https://www.nuget.org/packages/Microsoft.ML.TorchSharp/0.21.1
Pinging @michaelgsharp for the state of NER and @JakeRadMSFT for Model Builder.
It went out with ML.NET 3.0.1. I don't remember seeing any issues with extra entities or strings. I can take a look into it though in the next couple of days. I wonder if it has something to do with Model builder itself.
@JakeRadMSFT @luisquintanilla @LittleLittleCloud do you know the status of NER in model builder? have you seen any issues with it?
Hello ML.NET,
Is there any way I can use ML.NET to created named entities?
Thanks, -Max