dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.

https://dot.net/ml

MIT License

9.05k stars 1.89k forks source link

Text loader v.s in-memory data structure in API reference samples #2726

Closed wschin closed 5 years ago

wschin commented 5 years ago

We often starts our trainer examples with text loader but recently I feel loading text into IDataView is not directly related to the actual training procedure. If we use

/// <summary>
/// Example with one binary label and 10 feature values.
/// </summary>
public class BinaryLabelFloatFeatureVectorSample
{
    public bool Label;

    [VectorType(_simpleBinaryClassSampleFeatureLength)]
    public float[] Features;
}

as our in-memory example, we can create more flexible examples like scikit-learn ones (where data matrix is float matrix) and make ML.NET's learning curve smoother (because users don't need to learn text loader, the loaded data, and trainer at the same time).

cc @shmoradims, @rogancarr, @sfilipi, @shauheen

2780 shows a scikit-learn-style example for ML.NET. It is

Self-contained --- To understand it, user doesn't need to look another document or use Visual Studio to search for those used functions. Notice that we can't rely on Visual Studio because not everyone is using it (1st partry and 3rd party experiences should the same!).
End-to-end to C# developers --- because it trains a model over a C# List and get the prediction back as a C# List (The two ends are not IDataView so user doesn't need to learn IDataView to play with that API).
Independent to external packages --- We shouldn’t expect a user who needs doc knows SamplesUtils.
Production-friendly --- Doing prediction with C# data structure is included. That's how a trained model will be used in production.

shmoradims commented 5 years ago

I chatted with Wei-Sheng offline about this. I can summarize the cons and pros as follows:

Pros of in-memory (Cons of Textloader): Simpler data pipeline and more focused on the actual trainer that's being documented. It doesn't involve downloading, TextLoader, featurization, etc., so there's less friction for users to understand the sample and adapt it to their scenario.
Cons of in-memory (Pros of Textloader): More meaningful data (Adult UCI vs randomly generated data)

Wei-Sheng has strong preference for adopting in-memory data for our samples. I'm ambivalent. @rogancarr @sfilipi what do you think?

bartczernicki commented 5 years ago

I will add that one of the ways I pitch ML.NET to customers is that it allows you to put models directly into memory right next to your existing business logic/rules engines (Web APIs, ASP.NET MVC/Web Forms, Win Forms etc.). I realize this suggestion was for trainers and not inference, but once I amplify to developers (keep in mind they are not data scientists) you don't need to load new data and you probably already have it in-memory/in-process already it's one of the things I had to explain more than once.

wschin commented 5 years ago

@bartczernicki, Thanks a lot for your input. Looks like in-memory scenario is more closer to C# developers, right?

rogancarr commented 5 years ago

Let's do both kinds of examples, but I think we should define what kinds of samples that we're building.

API Docs: The samples in docs/samples/Microsoft.ML.Samples/ appear on https://docs.microsoft.com in-lined into the API documentation (e.g. for GAMs). These samples should focus on the learner and do as little data manipulation as possible. We have been moving the data loading into helper functions like this
```
var data = LoadHousingRegressionDataset(mlContext);
```
so that we don't bog down the documentation on how a learner works and what produces with tons of lines describing data loading. I am hesitant to have these focus on more than just the API in question.
Samples: We have a sample repository, https://github.com/dotnet/machinelearning-samples, that has long-form samples showing how to use ML.NET for long-form, end-to-end samples, and are pulled into the samples portion of https://docs.microsoft.com. This repository is where we should focus on building out copy-and-paste examples that people can use for file-based and memory-based training. I'd focus energy here if we want to give a focus for in-memory training and push for documentation similar to sklearn.

wschin commented 5 years ago

@rogancarr, the GAM example has too many thing not directly related to the trainer. Moving things into a function doesn't really increase the flexibility. Assume that I already have a data matrix and label vector like those in many scikit-learn trainer examples:

X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]

What is the gap between GAM example and the training pipeline I want? I'd imagine I need to

Look into the data structure returned by
```
var data = LoadHousingRegressionDataset(mlContext);
```
If I were smart enough, I will clone ML.NET, open Visual Studio, search for the definition of LoadHousingRegressionDataset, go through all lines inside, realize all I need is just an IDataView, and finally define my own class like this
```
    /// <summary>
    /// Example with one binary label and 10 feature values.
    /// </summary>
    public class BinaryLabelFloatFeatureVectorSample
    {
        public bool Label;

        [VectorType(_simpleBinaryClassSampleFeatureLength)] // _simpleBinaryClassSampleFeatureLength =  10, for example.
        public float[] Features;
    }
```
Oh wait, how should I know I need to define my own class? Why should I have Visual Studio? Note that finding an example class with a vector field (i.e., public float[] Features here) was a nightmare back to months ago (I guess we have a better documentation now but I still want to emphasize that any small hole here could block a new user forever).
Then, I need to realize that I need to use this undocumented (I call everything without an example undocumented) function.

Those points mentioned above might explain why scikit-learn trainer examples always do things as simple as

X = [[0], [1], [2], [3]]
Y = [0, 1, 2, 3]
clf = svm.SVC(gamma='scale', decision_function_shape='ovo')
clf.fit(X, Y)

We have many end-to-end examples. However, that end-to-end doesn't really mean end-to-end if we define an end is something easily-understandable to users.

TomFinley commented 5 years ago

Without interfering too much in matters of samples and documentation, my own sympathies at least at first glance are strongly with @wschin.

As @bartczernicki points out, it is inevitable even from the first example that you need to introduce the subject of in memory consumption anyway, since that is the most plausible way the trained models will be consumed. When doing predictions, you're not going to be consuming from a file. So, you need to have that.

Now IDataLoader is an essential thing to cover at some point, but I agree that it could wait till "lesson 2" or "lesson 4" or somesuch. In lesson 1 we will have our hands full enough with the "trinity" of IEstimator/ITransformer/IDataView (#581), I think.

Beyond that core and necessary part, I think about what I have to explain. I think about what I have to explain in @wschin's world: I can say, "hey look, here's an array of length 150 of IrisExample with these five fields, and you can convert these instances into one of our IDataView structures like so" and I think a C# developer will just sort of get this. But to say, "look man, we start with this file, and it has tabs, but we can sort of load it by setting the text loader like so to coinfigure it, but just be aware it's not actually in memory because these structures are lazily evaluated so it loads it when needed in a streaming fashion, but you can control that with these additional things," that's a harder conversation. That was a conversation we had to have when this code was exposed as a command-line tool, but now we're writing an API, and this is an opportunity to have an easier conversation. At least, to start.

Otherwise we inevitably have to get into conversations about in-memory vs. out-of-memory structures, the implications of lazily-evaluated structures like IDataView w.r.t. this, the whole interface for parsing out files, and all this all at once, which I think maybe might confuse people. And is not in any event how I think a C# developer thinks about their data.

(I might at least mention in @wschin 's world that this is a simple example and we have other mechanisms for handling out-of-core data -- imagining myself as an outside reader, I would get suspicious myself if I did not receive some assurance on this point --, but I wouldn't hit people over the head with the details of how that is done right off the bat.)

Again, not trying to interfere in matters of documentation and samples. Just registering my own thoughts on this subject, feel free to ignore.

Ivanidzo4ka commented 5 years ago

I would prefer following way: If you have example of data transformation: use in memory format. If you have example of trainer: use data reading.

My reasoning is following:

If I get to the learner I probably already build some pipeline for my data. So I would prefer to show user how to specify certain options and what columns learner produce, how to make prediction, how to get metrics out of it. Basically focus on learner rather than pipeline building. I like python snippets by @wschin , but C# is not so expressive, and we will end up with lot of extra code in our samples which would take away whole point of sample.

Data transformation: I think it's necessary to have in memory examples since we transform data. So we should show before and after stages. Which is hard with data reading.At least in cases where we work with data itself rather than schema of data, for data schema I would prefer lesser footprint

Maybe bad, but still example: https://docs.microsoft.com/en-us/dotnet/api/system.drawing.image.pixelformat?view=netframework-4.7.2 No one constructing bitmap in memory, they just read it from somewhere. Or this one: https://docs.microsoft.com/en-us/dotnet/api/system.speech.recognition.grammar.speechrecognized?view=netframework-4.7.2 No one even cares about data, they just trying to show what is interface and how you can use it.

@JRAlexander has probably thousand times more expertise than we are, so can be nice to add him to this discussion.

wschin commented 5 years ago

A trainer is definitely not a thing should start with data reading. Why should a user learn an IDataView type systems and our text format before doing training? Training a linear model is not a ML.NET specific thing; it should be easily doable to everyone who only knows C# fundamentals. Thinking about scikit-learn examples, do they ask user to learn numpy sparse data structure before training a SVM?

zeahmed commented 5 years ago

Thanks @wschin for bring this up.

I think this is the matter of showing ML.NET's preferred way of importing data into pipeline. Either it is using C# structure (in-memory) or from file (using text loader).

If all of our samples contain the in-memory streaming then our message to user is that the preferred way to build model in ML.NET is using in-memory streams.

Scikit example here does not make much sense because scikit does not provide data loading support (use numpy). So, for them data loading is not concern at all. If we are going to be in the same boat then there is no-doubt about using in-memory streaming.

There is also a question of how many samples will be sufficient for the training of the transform or learner. If all of learners and trainers can be trained on just 5-10 examples then it is perfect otherwise creation of data will fill up the sample.

wschin commented 5 years ago

C# API suppose to primarily work with C# data structures, not files! Scikit-learn example is an example for trainers, why it needs to load data? Training is just training; loading is just loading. Why do we need to mix them in a training API's example?

shmoradims commented 5 years ago

@wschin I'm not sure if this thread will converge because there are multiple trade-offs at play: 1) user experience, 2) size of the sample code, 3) self-containedness of sample code (removing SampleUtils), 4) data size (many of our trainers' defaults are tuned for large data. If we use small data we have to change those parameters, which could give the impression to the users that they have to specify all parameters, as opposed to just use defaults as starting point).

My suggestion is that you write up your ideal sample code for one trainer. Then the team can review the proposal. Having actual code would be simpler.

We can repeat the same for transforms.

wschin commented 5 years ago

Here is one ideal example in my mind (mentioned in the first post of this thread) for mlContext.AnomalyDetection.Trainers.RandomizedPca(featureColumnName: nameof(DataPoint.Features), rank: 1, center: false).

(1) Full version. From training to prediction with detailed comments.

    public static class RandomizedPcaSample
    {
        public static void Example()
        {
            // Create a new context for ML.NET operations. It can be used for exception tracking and logging, 
            // as a catalog of available operations and as the source of randomness.
            // Setting the seed to a fixed number in this example to make outputs deterministic.
            var mlContext = new MLContext(seed: 0);

            // Training data.
            var samples = new List<DataPoint>()
            {
                new DataPoint(){ Features= new float[3] {1, 0, 0} },
                new DataPoint(){ Features= new float[3] {0, 2, 1} },
                new DataPoint(){ Features= new float[3] {1, 2, 3} },
                new DataPoint(){ Features= new float[3] {0, 1, 0} },
                new DataPoint(){ Features= new float[3] {0, 2, 1} },
                new DataPoint(){ Features= new float[3] {-100, 50, -100} }
            };

            // Convert the List<DataPoint> to IDataView, a consumble format to ML.NET functions.
            var data = mlContext.Data.LoadFromEnumerable(samples);

            // Create an anomaly detector. Its underlying algorithm is randomized PCA.
            var pipeline = mlContext.AnomalyDetection.Trainers.RandomizedPca(featureColumnName: nameof(DataPoint.Features), rank: 1, center: false);

            // Train the anomaly detector.
            var model = pipeline.Fit(data);

            // Apply the trained model on the training data.
            var transformed = model.Transform(data);

            // Read ML.NET predictions into IEnumerable<Result>.
            var results = mlContext.Data.CreateEnumerable<Result>(transformed, reuseRowObject: false).ToList();

            // Let's go through all predictions.
            // Lines printed out should be
            //   The 0 - th example with features[1, 0, 0] is an inlier with a score of being inlier 0.7453707
            //   The 1 - th example with features[0, 2, 1] is an inlier with a score of being inlier 0.9999999
            //   The 2 - th example with features[1, 2, 3] is an inlier with a score of being inlier 0.8450122
            //   The 3 - th example with features[0, 1, 0] is an inlier with a score of being inlier 0.9428905
            //   The 4 - th example with features[0, 2, 1] is an inlier with a score of being inlier 0.9999999
            //   The 5 - th example with features[-100, 50, -100] is an outlier with a score of being inlier 0
            for (int i = 0; i < samples.Count; ++i)
            {
                // The i-th example's prediction result.
                var result = results[i];

                // The i-th example's feature vector in text format.
                var featuresInText = string.Join(',', samples[i].Features);

                if (result.PredictedLabel)
                    // The i-th sample is predicted as an inlier.
                    Console.WriteLine("The {0}-th example with features [{1}] is an inlier with a score of being inlier {2}",
                        i, featuresInText, result.Score);
                else
                    // The i-th sample is predicted as an outlier.
                    Console.WriteLine("The {0}-th example with features [{1}] is an outlier with a score of being inlier {2}",
                        i, featuresInText, result.Score);
            }
        }

        // Example with 3 feature values. A training data set is a collection of such examples.
        private class DataPoint
        {
            [VectorType(3)]
            public float[] Features { get; set; }
        }

        // Class used to capture prediction of DataPoint.
        private class Result
        {
            // Outlier gets false while inlier has true.
            public bool PredictedLabel { get; set; }
            // Outlier gets smaller score.
            public float Score { get; set; }
        }
    }

(2) Short version.

    public static class RandomizedPcaSample
    {
        public static void Example()
        {
            var mlContext = new MLContext(seed: 0);
            // Define training set.
            var samples = new List<DataPoint>()
            {
                new DataPoint(){ Features= new float[3] {1, 0, 0} },
                new DataPoint(){ Features= new float[3] {0, 2, 1} },
                new DataPoint(){ Features= new float[3] {1, 2, 3} },
                new DataPoint(){ Features= new float[3] {0, 1, 0} },
                new DataPoint(){ Features= new float[3] {0, 2, 1} },
                new DataPoint(){ Features= new float[3] {-100, 50, -100} }
            };
            // Convert training data to IDataView, the general data type used in ML.NET.
            var data = mlContext.Data.LoadFromEnumerable(samples);
            // Define trainer.
            var pipeline = mlContext.AnomalyDetection.Trainers.RandomizedPca(featureColumnName: nameof(DataPoint.Features), rank: 1, center: false);
            // Train the model.
            var model = pipeline.Fit(data);
        }

        private class DataPoint
        {
            [VectorType(3)]
            public float[] Features { get; set; }
        }
    }

rogancarr commented 5 years ago

My main concern is not whether we use in-memory or file-based data loading in samples. My main concern is what the format for samples in this repository are.

Here are my beliefs:

This repository exists to populate documentation pages. For an example, see the FCC docs.
This repository will not be cloned to use samples.
The primary copy-paste mode for these samples is to copy-paste the specific elements from the docs page (e.g. for FCC, it's the corresponding lines for calling FCC) and not full programs.
End-to-end samples, getting started samples, and clone-able, copy-paste-able samples shall live in the machinelearning-samples repository.

@wschin, as you point out, a lot of these samples, like GAMs and FCC, are way too verbose and don't fit this scheme. These docs are a work in progress and are currently being refactored. I believe the solution is to move the verbose examples into the Samples repository and make the samples in this project smaller and more succinct.

So what I object to here is to adding tons of boilerplate code to the https://docs.microsoft.com pages.

In terms of your samples, maybe what we want is this:

 public static class RandomizedPca
    {
        public static void Example()
        {
            // Create a new context for ML.NET operations. It can be used for exception tracking and logging, 
            // as a catalog of available operations and as the source of randomness.
            // Setting the seed to a fixed number in this example to make outputs deterministic.
            var mlContext = new MLContext(seed: 0);

            // Convert the List<DataPoint> to IDataView, a consumble format to ML.NET functions.
            var data = SampleUtils.LoadFakeData();

            // Create an anomaly detector. Its underlying algorithm is randomized PCA.
            var pipeline = mlContext.AnomalyDetection.Trainers.RandomizedPca(featureColumnName: nameof(DataPoint.Features), rank: 1, center: false);

            // Train the anomaly detector.
            var model = pipeline.Fit(data);

            // Apply the trained model on the training data.
            var transformed = model.Transform(data);

            // Read ML.NET predictions into IEnumerable<Result>.
            var results = mlContext.Data.CreateEnumerable<Result>(transformed, reuseRowObject: false).ToList();

            // Let's go through all predictions.
            for (int i = 0; i < samples.Count; ++i)
            {
                // The i-th example's prediction result.
                var result = results[i];

                // The i-th example's feature vector in text format.
                var featuresInText = string.Join(',', samples[i].Features);

                if (result.PredictedLabel)
                    // The i-th sample is predicted as an inlier.
                    Console.WriteLine("The {0}-th example with features [{1}] is an inlier with a score of being inlier {2}",
                        i, featuresInText, result.Score);
                else
                    // The i-th sample is predicted as an outlier.
                    Console.WriteLine("The {0}-th example with features [{1}] is an outlier with a score of being inlier {2}",
                        i, featuresInText, result.Score);
            }
            // Expected output:
            //   The 0 - th example with features[1, 0, 0] is an inlier with a score of being inlier 0.7453707
            //   The 1 - th example with features[0, 2, 1] is an inlier with a score of being inlier 0.9999999
            //   The 2 - th example with features[1, 2, 3] is an inlier with a score of being inlier 0.8450122
            //   The 3 - th example with features[0, 1, 0] is an inlier with a score of being inlier 0.9428905
            //   The 4 - th example with features[0, 2, 1] is an inlier with a score of being inlier 0.9999999
            //   The 5 - th example with features[-100, 50, -100] is an outlier with a score of being inlier 0
        }
}

I might be off-base here though on what we want in the various repositories and on the docs pages. I'd like to hear from @CESARDELATORRE, @JRAlexander, and @eerhardt about their expectations for the documentation vs. samples.

wschin commented 5 years ago

@rogancarr, we can't hide the definition of DataPoint and its definition. Without them, it's hard to user to know how to apply that example to data point with a different number of features (say 10). I don't think user will easily know they need to change

        private class DataPoint
        {
            [VectorType(3)]
            public float[] Features { get; set; }
        }

        private class DataPoint
        {
            [VectorType(10)]
            public float[] Features { get; set; }
        }

This was a problem bothering my for hours. I don't want users go through this again.

bartczernicki commented 5 years ago

Someone coming in from the outside of this project, I have used 4 primary resources to learn: Documentation, ML.NET Samples, ML.NET Cookbook, Internal tests/samples (dynamic vs static) Note: blogs were useful before API changes, now most of the code doesn't compile on the blog sites because of the release cadence

I agreee with @rogancarr about keeping doc samples succint. Something like this would be nice: 1) Doc site - Succint examples conveying core ML.NET functionality 2) ML.NET Cookbook - Snippets of the proper way of performing X with ML.NET with caveat/gotcha notes for the majority of tasks that most people will do. For example, when loading a model I convert it to a TranformerChain rather than ITransformer because otherwise you can't do any ML governance on the loaded model (without .NET reflection). 3) ML.NET Samples - End to end samples of a certain task (i.e. doing binary classification), organized by scenarios. I.e. (Web API for inference, Azure Functions distributed inference, parallelized training jobs etc.) I think this is where our samples need to evolve to have some more advanced production-type scenarios.
4 ) Internal tests/samples - advanced code samples/some underlying API explanations/under the hood

I do agree with @wschin some of the examples/samples seem to skip pretty important caveats and without getting into the weeds of the API (which should never happen) it's hard to tell what is happening. For example, why can't some models be saved as ONNX/why can't some models return weights/why can't some models do PFI/why is a simple ML 101 construct like a ConfusionMatrix seemingly gone (previous API had it)/why do some algorithms not have probabilities. I get that you don't want to have basic examples with scary long Interface casts since this is meant to be a fluent API, but somethings (as someone who does AI daily) shouldn't be this hard to do.

wschin commented 5 years ago

Someone coming in from the outside of this project, I have used 4 primary resources to learn: Documentation, ML.NET Samples, ML.NET Cookbook, Internal tests/samples (dynamic vs static) Note: blogs were useful before API changes, now most of the code doesn't compile on the blog sites because of the release cadence

I agreee with @rogancarr about keeping doc samples succint. Something like this would be nice:
1. Doc site - Succint examples conveying core ML.NET functionality

2. ML.NET Cookbook - Snippets of the proper way of performing X with ML.NET with caveat/gotcha notes for the majority of tasks that most people will do.  For example, when loading a model I convert it to a TranformerChain rather than ITransformer because otherwise you can't do any ML governance on the loaded model (without .NET reflection).

3. ML.NET Samples - End to end samples of a certain task (i.e. doing binary classification), organized by scenarios.  I.e. (Web API for inference, Azure Functions distributed inference, parallelized training jobs etc.)  I think this is where our samples need to evolve to have some more advanced production-type scenarios.
   4 ) Internal tests/samples - advanced code samples/some underlying API explanations/under the hood
Sounds good to me but we also need a precise definition of succinct. Let's consider what will happen if a C# developer just starts doing binary classification with C#? I believe a normal pattern could be

Google for c sharp binary classification. All the pages show on my screen are super long examples such as this and this.
After understanding all concepts of data preparation, file loading, IDataView, etcs. They will finally start building their own pipeline.

I could imagine those tasks requiring users to use Visual Studio to do some experiments and exploration, so ML.NET becomes super windows-friendly and Linux users will have different experiences than Windows users. Is this gap a cross-platform machine learning library really wants? Fortunately, training a binary classifier has been standardized in textbooks, on wiki, and so on --- it's just a function for finding map from a real-valued feature vector to a binary label. What's the closest thing of feature vector in C# that every C# developer is familiar with? It's a float[]. What is that for binary label? It's bool. So a trainer API's documentation will also contain a definition of standard training data (aka DataPoint below)

    public static class BinaryClassificationSample
    {
        public static void Example()
        {
            var mlContext = new MLContext(seed: 0);
            // Define training set.
            var samples = new List<DataPoint>()
            {
                new DataPoint(){ Label = 0, Features = new float[3] {1, 1, 0} },
                new DataPoint(){ Label = 0, Features = new float[3] {0, 2, 1} },
                new DataPoint(){ Label = 1, Features = new float[3] {-1, -2, -3} },
            };
            // Convert training data to IDataView, the general data type used in ML.NET.
            var data = mlContext.Data.LoadFromEnumerable(samples);
            // Define trainer.
            var pipeline = mlContext.BinaryClassification.Trainers.FastTree(featureColumnName: nameof(DataPoint.Features));
            // Train the model.
            var model = pipeline.Fit(data);
        }

        private class DataPoint
        {
            bool Label { get; set; }
            [VectorType(3)]
            public float[] Features { get; set; }
        }
    }

This way we align the concept everyone learn in school with its C# implementation. It's platform-natural, self-contained, and general-enough to be extended to other cases.

I do agree with @wschin some of the examples/samples seem to skip pretty important caveats and without getting into the weeds of the API (which should never happen) it's hard to tell what is happening. For example, why can't some models be saved as ONNX/why can't some models return weights/why can't some models do PFI/why is a simple ML 101 construct like a ConfusionMatrix seemingly gone (previous API had it)/why do some algorithms not have probabilities. I get that you don't want to have basic examples with scary long Interface casts since this is meant to be a fluent API, but somethings (as someone who does AI daily) shouldn't be this hard to do.

ONNX thing is not standardized and you can't find it in textbooks, I guess we may not have a detailed example for it.

Yes, I don't think we can explain everything. To avoid explaining everything happening, the start and end of an API example should be something we don't need to explain so we can focus on the targeted API itself. The thing I want to have is, for training APIs, starting a training process with something every C# developer knows and ends up with something every C# developers knows. I am quite confident that IDataView is definitely not a thing every C# developer knows, so we need to have

        private class DataPoint
        {
            bool Label { get; set; }
            [VectorType(3)]
            public float[] Features { get; set; }
        }

JRAlexander commented 5 years ago

Or you could make some of these examples into how-tos in the ML.NET Guide on Docs like we did with PFI (they are rendered as part of the build process): https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/determine-global-feature-importance-in-model These are also translated into several languages such as chinese: https://docs.microsoft.com/zh-cn/dotnet/machine-learning/how-to-guides/determine-global-feature-importance-in-model Russian: https://docs.microsoft.com/ru-ru/dotnet/machine-learning/how-to-guides/determine-global-feature-importance-in-model and many more. Something to consider.

wschin commented 5 years ago

@JRAlexander, we are deciding what template of API (neither scenario example nor machine learning tutorial) example should look like. For example, this is the trainer API of gradient boosting decision tree to binary classification:

        public static FastTreeBinaryClassificationTrainer FastTree(this BinaryClassificationCatalog.BinaryClassificationTrainers catalog,
            string labelColumnName = DefaultColumnNames.Label,
            string featureColumnName = DefaultColumnNames.Features,
            string exampleWeightColumnName = null,
            int numberOfLeaves = Defaults.NumberOfLeaves,
            int numberOfTrees = Defaults.NumberOfTrees,
            int minimumExampleCountPerLeaf = Defaults.MinimumExampleCountPerLeaf,
            double learningRate = Defaults.LearningRate)
        {
            Contracts.CheckValue(catalog, nameof(catalog));
            var env = CatalogUtils.GetEnvironment(catalog);
            return new FastTreeBinaryClassificationTrainer(env, labelColumnName, featureColumnName, exampleWeightColumnName, numberOfLeaves, numberOfTrees, minimumExampleCountPerLeaf, learningRate);
        }

What should its example look like? A core goal of ML.NET is democratizing machine learning. Do we want a user only knows

Fundamentals of C# syntax
Entry-level knowledge about machine learning (at least took one related course in school)

starting doing binary classification immediately after seeing the API document of training a binary classifier? My answer is absolute yes and my example is built based on this assumption. Let me copy-and-paste my proposed template here again:

    public static class BinaryClassificationSample
    {
        public static void Example()
        {
            var mlContext = new MLContext(seed: 0);
            // Define training set.
            var samples = new List<DataPoint>()
            {
                new DataPoint(){ Label = 0, Features = new float[3] {1, 1, 0} },
                new DataPoint(){ Label = 0, Features = new float[3] {0, 2, 1} },
                new DataPoint(){ Label = 1, Features = new float[3] {-1, -2, -3} },
            };
            // Convert training data to IDataView, the general data type used in ML.NET.
            var data = mlContext.Data.LoadFromEnumerable(samples);
            // Define trainer.
            var pipeline = mlContext.BinaryClassification.Trainers.FastTree(featureColumnName: nameof(DataPoint.Features));
            // Train the model.
            var model = pipeline.Fit(data);
        }

        private class DataPoint
        {
            bool Label { get; set; }
            [VectorType(3)]
            public float[] Features { get; set; }
        }
    }

shmoradims commented 5 years ago

I summarized our design space below:

Dimension	Option A	Option B
Usability/ Readability/ Flexibility	Self-Contained: No dependency on SampleUtils. User doesn't need to open any other file or class to fully understand the sample. All boilerplate code is included.	Hide-Boilerplate: Hide all boilerplate code for creating and loading data. User can clearly see how a particular API (trainer/transform) is used, but need to look up other files/classes to understand data loading and manipulation.
Data-source	In-memory: Fake data is created with C# lists/arrays, then converted to IDataView.	Text-loader: Real life datasets are loaded using text loader, which also requires featurization pipeline.
Scope	Minimal: Show only how to call the API.	Verbose: Show things like evaluation metrics, predictions, etc.

We should decide both for trainers and transforms. Our current trainer samples are hide-boilerplate, text-loader, verbose. Wei-Sheng is suggesting self-contained, in-memory, (any).

Let's finalize this over a meeting.

wschin commented 5 years ago

It will be easier to make decision if we have an agreement on the targeted audiences. Here are my assupations about our major (and potential) users of C# APIs.

They are C# developers (ML.NET is never the primilry choices for people who are familiar with C++ or Python).
They already know (at least entry-level) machine learning (otherwise, they should use AutoML, not ML.NET).
They do not necessarily have Visual Studio installed. I don't believe documentation should depend on any functionaility provided by IDEs.
They are not familar with ML.NET (millions of C# developers but we only have 5k stars).

In addition to the targeted users, we also need to determine what they can do after reading an the documentation of a binary classification trainer (decision made can be extended to other trainers). Notice that we're talking about API documents, not neither scenario examples nor tutorials. Personally, I think

They should be able to train a binary classification model with standard inputs and standard outputs described in textbooks. (If they don't know binary classification, they should use AutoML or find some machine learning tutorials.)
They can copy-and-paste our code to further explore other options provided by that trainers.
They don't need to learn other APIs and whole ML.NET for understanding a single API. Showing how different APIs work together should be described in scenario examples and tutorials, not an API document.
They can write a toy example using VIM or notepad++.

clauren42 commented 5 years ago

I believe examples should be self contained but use real text loader and be verbose so folks learn how to evaluate the quality of the model etc in the same example without having to refer to other docs. This helps demonstrate real usage with best practices instead of just explaining how to use a specific API

wschin commented 5 years ago

self-contained + using text loader means --- the user need to learn IDataView, ML.NET's text format, ML.NET's text loader, all APIs used in featurization, and finally the API they want. But wait, does training has anything to do with data loading? If yes, why do we divide loader and trainer into two independent modules? So I think no!

I can also honestly tell you --- if you search for ML.NET examples in Chinese (why Chinese? It just filters out all our documents so we can focus on what users are doing), you will see they all copy-and-paste our entire examples, which means our examples are hard to be understood, adjusted and generalized. One of them even asked Can ML.NET handle in-memory data? (which is the entire reason of having C# APIs. Why? Otherwise, we just need command-line tools).

Furthermore, @clauren42, two things we often do is to ask C# developer to learn machine learning through a single example and expect that they will be able to make their own pipelines. Two implicit assumptions here are

User's data is similar to ours.
An example (or a few examples) can make a C# developer a data scientist.

These two assumptions don't look super true to me and confirmed by users (see this, this, this).

Btw, self-contained is something I like the most.

clauren42 commented 5 years ago

Most ml examples I've seen use some sort of helper function to load data (taxi fare, breast cancer, etc) rather than construct data in memory. Scikit-learn and TF obviously could take the same approach using python, but they don't. Being able to look at the data file is pretty helpful for people to understand what's going on...but if were talking about API level samples maybe in memory is fine. For getting started / how to content I think most samples should use train and test data sets vs in memory construction of data.

wschin commented 5 years ago

@clauren42, yes, we are talking about API documents, not tutorials, demos, or samples. In-memory is not just fine, we must have them. Text file is confusing C# developers as I have shown in several cases and users are not able create new things from it. So please forget about what we have created (not remove them) --- we need to listen to C# developers and talk their language instead of making API documents in data-science style.

In addition, the definition of helpful is not quite clear to me. It's good in terms of showing off our ability of doing data science, but I doubt if it's what C# developers really need when they just want to call a single function (developers always need to work with scientists. Why do they need to learn feature engineering?).

I am glab that you mentioned scikit-learn, but I think your impression is wrong. Let's take a look at their linear trainers's API documents --- only 6 out of 39 use real data sets (that is, 85% of them embrace fake and in-memory). Again, please do not treat API documents as tutorials or demos. For TF, everyone used to work on keras/tensorflow converter knows how suck its documentation is.

clauren42 commented 5 years ago

For API level doc's I agree in memory should be fine, perhaps even preferable.

Get Outlook for Androidhttps://aka.ms/ghei36

From: Wei-Sheng Chin notifications@github.com Sent: Sunday, March 3, 2019 12:29:22 PM To: dotnet/machinelearning Cc: Chris Lauren; Mention Subject: Re: [dotnet/machinelearning] Text loader v.s in-memory data structure in API reference samples (#2726)

@clauren42https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fclauren42&data=02%7C01%7Cchris.lauren%40microsoft.com%7Cdafba97a0dde4ac4681808d6a016ebf8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636872417649829326&sdata=AcVazUHCuEQ7CqaUT94ACIRklGzdKT2kBfsUnfK4zbQ%3D&reserved=0, yes, we are talking about API documents, not tutorials, demos, or samples. In-memory is not just fine, we must have them. Text file is confusing C# developers as I have shown in several cases and users are not able create new things from it. So please forget about what we have created (not remove them) --- we need to listen to C# developers and talk their language instead of making API documents in data-science style.

In addition, the definition of helpful is not quite clear to me. It's good in terms of showing off our ability of doing data science, but I doubt if it's what C# developers really need when they just want to call a single function (developers always need to work with scientists. Why do they need to learn feature engineering?).

I am glab that you mentioned scikit-learn, but I think your impression is wrong. Let's take a look at their linear trainershttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fscikit-learn.org%2Fstable%2Fmodules%2Fclasses.html%23module-sklearn.linear_model&data=02%7C01%7Cchris.lauren%40microsoft.com%7Cdafba97a0dde4ac4681808d6a016ebf8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636872417649839334&sdata=VWnU8zDJlaJQXBHad7gXbI%2BXdIC3vvHSXcAqpVlwfRs%3D&reserved=0's API documents --- only 6 out of 39 use real data sets (that is, 85% of them embrace fake and in-memory). Again, please do not treat API documents as tutorials or demos. For TF, everyone used to work on keras/tensorflow converter knows how suck its documentation is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fmachinelearning%2Fissues%2F2726%23issuecomment-469060828&data=02%7C01%7Cchris.lauren%40microsoft.com%7Cdafba97a0dde4ac4681808d6a016ebf8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636872417649849342&sdata=IpW1WxK7WzgAEFSqXRG3poCDlwptfTE%2FoLpiRe%2Bbcqg%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAtAjejrLZxICYMdCCRMA3T2x1doUT283ks5vTDCigaJpZM4bRERx&data=02%7C01%7Cchris.lauren%40microsoft.com%7Cdafba97a0dde4ac4681808d6a016ebf8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636872417649849342&sdata=Y32Y9vSl2%2Bb4p3ma9uW9v6yvAoWrR8H03vYSYml0tb0%3D&reserved=0.

shmoradims commented 5 years ago

We've adopted in-memory and self-contained style for API reference samples, whenever possible. Closing this discussion issue.