dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Getting started with ML .NET with in-memory data is *painful*. #3037

Closed isaacabraham closed 5 years ago

isaacabraham commented 5 years ago

Working with latest F#4.5 and net standard I'm having huge problems trying to do even the most basic explorations with the latest ML .NET. Is there any example showing an absolutely basic example for an in-memory dataset using a simple ML algorithm?

I'm talking something as simple as an example from e.g. scikit-learn e.g. the following hello world is 7 lines of code, and if you leave at the data loading side of things and just focus on the ML side of things - which is exactly what I want to do - it's the following three lines of code.

model = linear_model.LinearRegression()
model.fit(sqfeet, price)
model.predict( pd.DataFrame([1750]))

Lets try and port this into F#. Here's the source data as a simple F# list.

type Observation = { Area:int; Price:int }
let data =
    [ { Area = 1100; Price = 119000 }
      { Area = 1200; Price = 126000 }
      { Area = 1300; Price = 133000 }
      { Area = 1400; Price = 150000 }
      { Area = 1500; Price = 161000 }
      { Area = 1600; Price = 163000 }
      { Area = 1700; Price = 169000 }
      { Area = 1800; Price = 182000 }
      { Area = 1900; Price = 201000 }
      { Area = 2000; Price = 209000 } ]

I've spent a good few hours fighting with the API to try and get some - any - results. I can't figure it out.

Issues I've encountered:

  1. Discoverability. The API is pretty large and not (in my personal opinion) easy to navigate your way around. The namespaces need to be reworked so that the most obvious types are easy and obvious to get to.
  2. F# scripts are a pain because of the "occasional" reliance on native DLLs. However, you can work around this (or fall back to console applications if needed).
  3. Error messages are painful - I4, R4 etc. etc. - most people will not know what these are.
  4. Vector types - it seems that in order to "use" data with a trainer you need to "convert" data from e.g. float32 into a "vector" of float32. There's no explanation of what a "vector" in the context of ML .NET is, nor how to create one. Is it a .NET type? How do I create it? More than that, why as a developer should I have to care about it? I just want to give some of my data to the library as quickly and easily as possible.
  5. Why do I need to convert from ints or floats into float32s to do some machine learning? Again, this raises the barrier to entry. This is an internal implementation detail of ML .NET, it's nothing that should be forced on the developer.
  6. Why do I need the MLContext? What does it do? Does it store some "hidden state"? What? Why?

I managed to overcome some issues by randomly fumbling around with some existing samples until I got something that seemed to not error any more:

let estimator, mlContext =
    let mlContext = MLContext(Nullable 1)

    let trainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent(DefaultColumnNames.Label, "Features")

    EstimatorChain()
        .Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedArea", DataKind.Single, "Area")))
        .Append(mlContext.Transforms.CopyColumns(DefaultColumnNames.Label, "ConvertedArea"))
        .Append(mlContext.Transforms.Conversion.ConvertType(Transforms.TypeConvertingEstimator.ColumnOptions("ConvertedPrice", DataKind.Single, "Price")))
        .Append(mlContext.Transforms.Concatenate("Features", "ConvertedPrice"))
        .AppendCacheCheckpoint(mlContext)
        .Append(trainer), mlContext

Next. I try to fit my data to this model:

let dv = mlContext.Data.LoadFromEnumerable(data)
let trained = estimator.Fit(dv)

This returns, but then calls to CreatePredictionEngine fail with the error System.ArgumentOutOfRangeException: Could not find input column 'Area':

type PredictionInput = { Price : int }
[<CLIMutable>]
type PredictionOutput = { Area : int }

let z = trained.CreatePredictionEngine<PredictionInput, PredictionOutput>(mlContext)

z.Predict { Price = 1000 }

To get to this stage has taken 4-8 hours of effort (including spending 30-45 minutes with your team personally :-)). I don't consider myself a complete beginner when it comes to .NET / F# / machine learning - if it takes this long to get up and running, most people will simply not bother and go to scikit-learn, breeze or whatever else it out there.

I would love to see a simple API that looked something like this:

let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, "Area", "Price")
let prediction = model.Predict(1234)

or

let model = Trainers.Regression.StochasticDualCoordinateAscend.fit(data, fun d -> d.Area, fun d -> d.Price)
let prediction = model.Predict(1234)

etc. etc.

I get that there are more complicated scenarios - but I feel that this library should really be starting from the lowest common denominator and working from there. At the moment it seems to be the other way around.

singlis commented 5 years ago

Hi @isaacabraham,

Thank you very much for your feedback, this is all useful information that is good to hear. I am sorry to hear about the frustrations working with ML.Net in F#. Ideally this is not the experience we want users to undergo to learn and use the library.

First, here is the code to unblock your scenario:

open System
open Microsoft.ML
open Microsoft.ML.Data

[<CLIMutable>]
type Prediction = {
    [<ColumnName("Score")>] Area:single
}

type Observation = { Area:int; Price:int}
let data =
    [ { Area = 1100; Price = 119000 }
      { Area = 1200; Price = 126000 }
      { Area = 1300; Price = 133000 }
      { Area = 1400; Price = 150000 }
      { Area = 1500; Price = 161000 }
      { Area = 1600; Price = 163000 }
      { Area = 1700; Price = 169000 }
      { Area = 1800; Price = 182000 }
      { Area = 1900; Price = 201000 }
      { Area = 2000; Price = 209000 } ]

[<EntryPoint>]
let main argv =
    let estimator, mlContext =
        let mlContext = MLContext()
        EstimatorChain()
           .Append(mlContext.Transforms.Conversion.ConvertType("Features", "Price", DataKind.Single))
           .Append(mlContext.Transforms.Conversion.ConvertType("Label", "Area", DataKind.Single))
           .Append(mlContext.Transforms.Concatenate("Features", "Features"))
           .AppendCacheCheckpoint(mlContext)
           .Append(mlContext.Regression.Trainers.StochasticDualCoordinateAscent("Label", "Features"))
           , mlContext

    let data1 = mlContext.Data.LoadFromEnumerable<Observation>(data)
    let transformer = estimator.Fit(data1)

    let predictor = mlContext.Model.CreatePredictionEngine(transformer)
    let prediction:Prediction = predictor.Predict({Area=0; Price = 209000})
    printf "Prediction results %f" prediction.Area
    0 // return an integer exit code

As you mentioned, the SDCA trainer expects the Label to be of type float and Features to be a vector of floats. Therefore we are having to convert. In order to get the vector of floats for Features, a Concatenate is used.

For Prediction, we were able to use CreatePrediction without the generic arguments as F# was able to resolve the input and output. From there we can call Predict and get an expected result. Note that the Area needs to be provided as input even though that is what we are predicting. I filed this as something to change (#3063).

For the issues you mentioned: 1) Discoverablility is definitely something that we have discussed and we want to continue to improve upon. For v1.0, the decision was to use MLContext as a point of discoverability (similar to Http context or DB context in .net, see #1098). In addition, we have done namespace consolidation across the code base. Are there specific areas that you feel could be improved or features that were not that discoverable?

2) I did not quite follow the issue that you mention here. We do have native libraries that need to be referenced, but that reference should automatically happen through nuget and build. Can you clarify the pain points here?

3) Error messages should not be I4, R4, etc. We are renaming that now for v1.0 (#2046). These types are internal and should not be exposed to the user.

4 & 5) The fact that you have to convert and have an understanding of what the trainer is expecting (in addition to our vector type) is painful. Ideally the conversion should happen behind the scenes and not require the user to have knowledge about what a trainer is expecting. Having an automatic conversion would simplify the pipeline and get closer to having a simpler API. I filed issue #3060 to address this.

6) Besides the MLContext being the point of discoverability with our APIs, it also stores state information based on the current session. This can be an internal state such as the random seed that is synchronized across the trainers/transformers as well as public state, like the currently trained model. It also provides our logging infrastructure.

In addition to making the API simpler there is the matter that it took many hours to get to a solution. Part of this can be resolved through examples (which I believe you found) that are located here: https://github.com/dotnet/machinelearning-samples

These do contain F# examples, but we also have simpler samples that are in C# only. These end up on the docs site: https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml?view=ml-dotnet

Would it be helpful if these examples were also written in F#?

isaacabraham commented 5 years ago

Hi @singlis . Thanks for the really detailed reply. Let me address all your points:

  1. Yes, my original point wasn't very specific :-) I'm talking about the overall number of steps required in order to get a basic pipeline up and running. You need to know about converters, EstimatorChains, DataView, MlContext, Transforms, PredictionEngines etc. etc.. You shouldn't be exposing this many types - or at least not forcing the user to get involved with this much. My personal view is that you should be focusing the core api on just machine learning - don't try to own the data shaping pipeline, or at least decouple the two so that people can as easily as possible plug data straight into a trainer.

As another example of "magic", in your solution (and I eventually figured this out myself), you call the prediction field "Score". There's no way to know this for the type system or code - it's just "secret" knowledge that isn't clearly explained. At worst you should have an error message if you try to call predict without this field, or better yet encode this into the type system somehow.

  1. I notice that the fixed sample (thanks for providing that!) is a console application. In short: scripts don't have great support for native dlls. Long story: Scripts necessitate your manually referencing individual libraries rather than nuget packages directly. The official NuGet client has basically no support for scripts, although Paket has the ability to generate a "references" script which you can call which references all assemblies in your nuget packages. However, neither of these work with native assemblies, so you have to manually add the path of the assembly to Path environment variable.

  2. Great - thanks.

  3. Again, I'll just touch on the dataview element: I think it's painful. All the transformations that you do inside a DV are invisible. You can't see the impact of a specific transformation or modification easily, and it's (IMHO) completely non-idiomatic for a .NET developer - they'll typically use something like LINQ or F#'s collection libraries to work with their data. DataView actually seems more like a data frames library - why not consider bringing the DV into a standalone dataframes package with some optional integration into ML .NET rather than tightly coupling the two?

Regarding the samples - indeed, I went through several of them, both C# and F# (I come from a C# background so no problem there). But there wasn't an example that I could find that went through the absolute bare minimum as I wanted to do, and identifies each individual step. I went through all the examples in the samples repo e.g. Iris, Taxi etc. etc. - but it was basically hit and miss until I randomly stumbled upon the right combination of transforms etc. to get all the way through.

isaacabraham commented 5 years ago

Thinking about it, this issue isn't so much about F# as about the getting up and running with the API - people coming from C# will have mostly the same issues as I've had, I believe. With that, I'm changing the title of this issue.

singlis commented 5 years ago

Thank you for the clarification @isaacabraham -

We are always working to improve the API. Removing MLContext, IDataView, etc. is not something that can happen immediately but your feedback can impact future changes. There is the simplicity with being able to pass the data directly to the trainer - but I understand too the value with having an explicit pipeline.

In the meantime, there are immediate actions that we can take to help the user with the learning curve of the API and hopefully reduce the amount of time it takes to get something working. One of these is better documentation and more examples, therefore I have created the following issues to help with this:

https://github.com/dotnet/machinelearning/issues/3127 - to address knowing the input/output types of a transformer. This addresses the issue you mentioned with Scoring, this should not be secret knowledge and should have further explanation about it.

https://github.com/dotnet/machinelearning/issues/3100 - to address missing F# examples. I am going to setup an initial folder structure to where we can add F# examples. You are more than welcome to contribute to this, ideally I am trying to mirror what we have now in C#.

There are other additional issues that are being filed to help with the structure of the documentation, to know what API to use, how it works, if its a trainer what type of trainer, etc.

As for IDataView - IDataView is the basis for how we exchange data within our pipeline. Since it is integral part of ML.Net, you will not be able to learn ML.Net without learning about IDataView. It would be like learning C# without learning about IEnumerable. We have extracted IDataView into its own assembly Microsoft.ML.DataView and it has no dependencies on ML.Net. The thought here is that it can be used for other purposes outside of ML.Net. For example, a graphing application could take in an IDataView and be able to plot a chart -- this data could come from ML.Net or some other library that implements IDataView.

isaacabraham commented 5 years ago

Although more F# is always a good thing, in this case that this will necessarily solve the issue - there are quite a few that Don (and others, including myself) have done for the samples and most people that write F# know C# already and can map the two across.

The issues I've been encountering are shared across both C# and F# (and VB .NET) - it's the API itself that I think is the problem.

I get your point regarding IDataView - having a data framing library on .NET is a good thing (although there are already a couple out there such as Deedle...), and if you're fixed on making this a mandatory element of ML .NET - i.e in order to use ML .NET, people have to know about IDataView, then I think you should ensure that things fall into the pit of success. By this I mean, users not having to refer to reams of documentation to learn the API, but the API itself being self explanatory - currently, there's simply (again, IMHO) too much knoweldge required by the developer rather than the API guiding the user into doing the right thing through things like types (they can be really handy in cases like this :-)).

As an alternative, look at the scikit learn example for what I mean by a simple API that is obvious - you can see from the examples example what is happening, there's no need for comments, and it's just a few lines.

Hopefully this isn't coming across as a rant, but rather as constructive feedback. I'm really excited by the idea of having a first-class ML library on .NET, and although it's not quite there yet I'm hopeful that ML .NET will get there soon.

wschin commented 5 years ago

Minor note: looks like this thread and https://github.com/dotnet/machinelearning/issues/2726 reach the same conclusion about documentation.

wschin commented 5 years ago

Let's me close this issue as most of our APIs have in-memory samples at https://github.com/dotnet/machinelearning/tree/master/docs/samples/Microsoft.ML.Samples/Dynamic.

Feel free to reopen. :)