Closed isaacabraham closed 6 years ago
Some notes:
LF
for line endings. Opening in VS prompts me that the line endings are not consistent and asks if I want to normalize the line endings.@isaacabraham - are you on Windows? If so, can you open the imdb_labelled.txt file in VS and "normalize" the line endings to Windows (CR LF)
and see if that fixes the problem?
The other interesting thing I've noted about this data set is that it has 6 formatting errors:
Warning: Format error at (20,1)-(20,98): Illegal quoting
Warning: Format error at (323,1)-(323,17): Illegal quoting
Warning: Format error at (348,1)-(348,102): Illegal quoting
Warning: Format error at (197,1)-(197,21): Illegal quoting
Warning: Format error at (213,1)-(213,117): Illegal quoting
Warning: Format error at (845,1)-(845,101): Illegal quoting
Looking at the data - there are unmatched double quotes "
in the lines, for example:
" The structure of this film is easily the most tightly constructed in the history of cinema. 1
It is working for me on .NET Core 2.0. Another thing to try is to scrub these 6 formatting errors out of the file by removing the unmatched double quotes.
Yes, Windows here! I've just tried it in VS2017 (previously I was using Code) and have normalised the line endings. Now I get a completely different error:
System.InvalidOperationException: Entry point 'Transforms.TextFeaturizer' not found
at Microsoft.ML.Runtime.EntryPoints.EntryPointNode..ctor(IHostEnvironment env, ModuleCatalog moduleCatalog, RunContext context, String id, String entryPointName, JObject inputs, JObject outputs, Boolean checkpoint, String stageId, Single cost)
at Microsoft.ML.Runtime.EntryPoints.EntryPointNode.ValidateNodes(IHostEnvironment env, RunContext context, JArray nodes, ModuleCatalog moduleCatalog)
at Microsoft.ML.Runtime.EntryPoints.EntryPointGraph..ctor(IHostEnvironment env, ModuleCatalog moduleCatalog, JArray nodes)
at Microsoft.ML.Runtime.Experiment.Compile()
at Microsoft.ML.LearningPipeline.Train[TInput,TOutput]()
at <StartupCode$FSI_0010>.$FSI_0010.main@() in C:\Users\Isaac\Source\Repos\scratchpad\ml.fsx:line 25
Stopped due to error
Is there any runtime reflection / lookups for this stuff? It all compiles in the script file - just when that Train method is called, it goes pop.
Note - even with the normalised file I still get that error in Code.
(Sorry, I'm not even a novice in F#) Can you show what is in C:\Users\Isaac\Source\Repos\scratchpad\.paket\load\netstandard2.0\ML\ml.group.fsx
?
Is there any runtime reflection / lookups for this stuff?
Yes, ML.NET uses a "catalog" of components, which are discovered and invoked using reflection. See https://github.com/dotnet/machinelearning/blob/c023727b76970ab913ec1ce38276508835c17bcf/src/Microsoft.ML.Core/ComponentModel/ComponentCatalog.cs#L399-L414 for reference.
You can set "allowQuotedStrings = false" in TextLoader. I see that the text columns are not quoted for every example except for a few. This causes "The size of input lines is not consistent" error sometime.
@zeahmed Thanks - unfortunately changing to that gives a different error: Source column 'SentimentText' not found
.
@eerhardt no problem. The file is generated by Paket to load in all the assemblies required as dependencies from the ML library. Here's what it contains:
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Core.dll"
#r "../../../../../../../.nuget/packages/system.reflection.emit.lightweight/4.3.0/lib/netstandard1.3/System.Reflection.Emit.Lightweight.dll"
#r "../../../../../../../.nuget/packages/system.reflection.emit.ilgeneration/4.3.0/lib/netstandard1.3/System.Reflection.Emit.ILGeneration.dll"
#r "../../../../../../../.nuget/packages/google.protobuf/3.5.1/lib/netstandard1.0/Google.Protobuf.dll"
#r "../../../../../../../.nuget/packages/newtonsoft.json/11.0.2/lib/netstandard2.0/Newtonsoft.Json.dll"
#r "../../../../../../../.nuget/packages/system.codedom/4.4.0/lib/netstandard2.0/System.CodeDom.dll"
#r "../../../../../../../.nuget/packages/system.threading.tasks.dataflow/4.8.0/lib/netstandard2.0/System.Threading.Tasks.Dataflow.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.UniversalModelFormat.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Maml.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.InternalStreams.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.CpuMath.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Data.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Transforms.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.ResultProcessor.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.PCA.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.KMeansClustering.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.FastTree.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Api.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.Sweeper.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.StandardLearners.dll"
#r "../../../../../../../.nuget/packages/microsoft.ml/0.1.0/lib/netstandard2.0/Microsoft.ML.PipelineInference.dll"
#r "System"
#r "System.ComponentModel.Composition"
#r "System.Core"
@eerhardt
The other interesting thing I've noted about this data set is that it has 6 formatting errors:
I also see the Warnings, which are expected based on https://github.com/dotnet/docs/issues/5256#issuecomment-388070561.
I was wondering if there is there a way not to show the Warnings on the console?
In C# does this sample then work? Or is it the same issue with the sample data file?
yes, a working example is here: https://github.com/dotnet/docs/issues/5330 Get ride of all the data loading warning messages by setting "allowQuotedStrings: false".
Small update here. I have managed to get this working within a console application by also removing the use of records and replacing them with mutable classes. This is - from an F# perspective - undesirable but at least it's a starting point.
I'm still unable to get it to work from a script however, which is very important in my opinion from an data analysis point of view (@mathias-brandewinder can probably elaborate the rationale on why this is better than I. Or probably any Python machine learning person...). The error I'm now seeing is:
Binding session to 'c:\Users\Isaac\Source\Repos\scratchpad\.paket\load\netstandard2.0\ML\../../../../../../../.nuget/packages/newtonsoft.json/11.0.2/lib/netstandard2.0/Newtonsoft.Json.dll'...
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data ---> System.DllNotFoundException: Unable to load DLL 'CpuMathNative': The specified module could not be found. (Exception from HRESULT: 0x8007007E)
at Microsoft.ML.Runtime.Internal.CpuMath.Thunk.SumSqU(Single* ps, Int32 c)
at Microsoft.ML.Runtime.Data.LpNormNormalizerTransform.<>c__DisplayClass27_0.<GetGetterCore>b__5(VBuffer`1& dst)
at Microsoft.ML.Runtime.Data.ConcatTransform.<>c__DisplayClass36_0`1.<MakeGetter>b__0(VBuffer`1& dst)
at Microsoft.ML.Runtime.Data.DataViewUtils.Splitter.Consolidator.<>c__DisplayClass4_1.<ConsolidateCore>b__2()
--- End of inner exception stack trace ---
at Microsoft.ML.Runtime.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
at Microsoft.ML.Runtime.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
at Microsoft.ML.Runtime.Data.RootCursorBase.MoveNext()
at Microsoft.ML.Runtime.Training.TrainingCursorBase.MoveNext()
at Microsoft.ML.Runtime.FastTree.DataConverter.MemImpl.MakeBoundariesAndCheckLabels(Int64& missingInstances, Int64& totalInstances)
at Microsoft.ML.Runtime.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Double[][] binUpperBounds, Single maxLabel, Boolean dummy, Boolean noFlocks, PredictionKind kind, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
at Microsoft.ML.Runtime.FastTree.DataConverter.Create(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean diskTranspose, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
at Microsoft.ML.Runtime.FastTree.ExamplesToFastTreeBins.FindBinsAndReturnDataset(RoleMappedData data, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeaturIndices, Boolean categoricalSplit)
at Microsoft.ML.Runtime.FastTree.FastTreeTrainerBase`2.ConvertData(RoleMappedData trainData)
at Microsoft.ML.Runtime.FastTree.FastTreeBinaryClassificationTrainer.Train(RoleMappedData trainData)
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at Microsoft.ML.Runtime.Data.TrainUtils.TrainCore(IHostEnvironment env, IChannel ch, RoleMappedData data, ITrainer trainer, String name, RoleMappedData validData, ICalibratorTrainer calibrator, Int32 maxCalibrationExamples, Nullable`1 cacheData, IPredictor inpPredictor)
at Microsoft.ML.Runtime.EntryPoints.LearnerEntryPointsUtils.Train[TArg,TOut](IHost host, TArg input, Func`1 createTrainer, Func`1 getLabel, Func`1 getWeight, Func`1 getGroup, Func`1 getName, Func`1 getCustom, ICalibratorTrainerFactory calibrator, Int32 maxCalibrationExamples)
at Microsoft.ML.Runtime.FastTree.FastTree.TrainBinary(IHostEnvironment env, Arguments input)
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at Microsoft.ML.Runtime.EntryPoints.EntryPointNode.Run()
at Microsoft.ML.Runtime.EntryPoints.EntryPointGraph.RunNode(EntryPointNode node)
at Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAllNonMacros()
at Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAll()
at Microsoft.ML.LearningPipeline.Train[TInput,TOutput]()
at <StartupCode$FSI_0010>.$FSI_0010.main@() in c:\Users\Isaac\Source\Repos\scratchpad\ml.fsx:line 25
The error you are getting is caused by the runtime not finding the "native" assemblies that are used by ML.NET. These assemblies are in the NuGet package under the runtimes/win-x64/native
folder of the NuGet package. When you use <PackageReference>
in an MSBuild project, NuGet will automatically pull the correct native assets into your app's runtime directory.
We had a similar problem as above when using packages.config
, because NuGet doesn't automatically pull these native assets. So instead, we had to manually do it in the NuGet package when the project is using packages.config
. See #165 that fixed this.
I don't have any real experience with the F# scripting tooling. How does it normally handle native (C++) assemblies contained in a NuGet package? If there is something we can/should do in the NuGet package? Or are native assemblies from a NuGet package not supported in F# scripting?
@eerhardt That helped, and I have it working now. There are a few ways of doing this - the issue is that the F# Interactive process (FSI.exe) can't see the native dlls in any path / probing folder by default so it can't find them. F# scripts do have the ability to add a folder / path to probing using the #I
directive, but this only works for .NET assemblies
The most "fully featured" answer I found to this was here http://christoph.ruegg.name/blog/loading-native-dlls-in-fsharp-interactive.html. By adding the path to the native dlls before running the model, I got it to work i.e.
open System
let nativeDirectory = @"C:\Users\Isaac\.nuget\packages\microsoft.ml\0.1.0\runtimes\win-x64\native"
Environment.SetEnvironmentVariable("Path", Environment.GetEnvironmentVariable("Path") + ";" + nativeDirectory)
Unfortunately this is not especially easy to figure out. I've seen a similar issue recently with CosmosDB using some native assemblies - they aren't particularly easy to work with.
Regarding NuGet etc. - the main NuGet tooling is, to be honest, a dead loss from the point of F# scripting - you need some form of msbuild project file to mark your dependencies, and there's no easy way to reference the assemblies anyway, which is one of the reasons why many F# developers use Paket instead. Paket already supports the ability to generate a "load dependencies" file for scripts (as seen in my earlier post here) but it doesn't know about native dlls. @forki do you think that this is something that could be added to Paket's generate load scripts functionality? Are native folders a "proper" thing in NuGet packages?
Are native folders a "proper" thing in NuGet packages?
Check out https://docs.microsoft.com/en-us/nuget/create-packages/supporting-multiple-target-frameworks#architecture-specific-folders for the docs on the runtimes
folder:
If you have architecture-specific assemblies, that is, separate assemblies that target ARM, x86, and x64, you must place them in a folder named runtimes within sub-folders named {platform}-{architecture}\lib{framework} or {platform}-{architecture}\native
@eerhardt is there any way not to have to fall back to these native dlls?
Currently, no, the native assemblies are required.
However, we are exploring/thinking of other options here. The CpuMath assembly is written in C++ because it wants to use SIMD instructions, which were only available in C/C++. With .NET Core 2.1, these SIMD instructions are available through .NET APIs. We could replace the CpuMath assembly with .NET code that uses the same instructions. On .NET Framework, we would still require the native assembly in order to use the SIMD instructions, because this support is only for .NET Core.
Another option/thought here is to provide software fallback methods, which of course would be slower. But the advantage is that it would have wider reach where the SIMD instructions aren't available (for example on ARM processors).
Please tag this with "F#" (though it might not be specifically related to F#)
After doing #600 I think there is no F#-specific issue remaining here, see https://github.com/dotnet/machinelearning/issues/180 for the record issue
I'm trying out the sample shown here. However, whenever I try to train the model I get an error: "The size of input lines is not consistent". This is using the exact files that are specified in the tutorial so I'm not sure where I'm going wrong - any ideas?