Jupyter + ML.NET | DataFrame vs Dataview

aslotte commented 4 years ago

While working with ML.NET in Jupyter, it can sometimes feel like double work to have to load the data first in to a DataFrame, and then also in to a IDataView to be able to use it in e.g. a training pipeline.

It would be ideal if the .Fit and/or TrainTestSplit methods could take a DataFrame as a parameter, or if there was other interoperability methods to leverage, so that the data wouldn't need to be loaded in two places.

eerhardt commented 4 years ago

DataFrame implements IDataView. You can already pass it into the .Fit method.

See https://github.com/dotnet/try/blob/master/NotebookExamples/csharp/Samples/HousingML.ipynb

experiment.Execute takes an IDataView

https://github.com/dotnet/machinelearning/blob/a5977962dc240304b30fcf2d2df437c2d99b3b47/src/Microsoft.ML.AutoML/API/ExperimentBase.cs#L67-L69

And you can just pass an in-memory DataFrame in to any method that takes an IDataView. No need to load it twice.

You can do the same thing for TrainTestSplit, I just didn't do it in that notebook and instead wrote my own split method to show how it can be done.

aslotte commented 4 years ago

Thanks @eerhardt! I actually just saw the source code for the DataFrame class and realized the same thing. Was just about to close this issue, but you beat me to it.

aslotte commented 4 years ago

@eerhardt, I removed loading the data using the regular ML.NET way, but that also meant I discarded the schema. I ran in to an issue doing this when one of my columns isFraud was interpreted as an integer when it should have been a boolean. This causes issues when calling .Fit() which excepts the labelColumn to be a boolean.

Is there a way in the DataFrame to change type of a column? Or is this an ML.NET issue where it whould be a bit smarter and also accept an integer if the range is 0-1?

eerhardt commented 4 years ago

ML.NET is pretty strict on what the trainers accept for input. Originally, they were more lenient, but this led to inconsistencies across the library where one BinaryClassification algorithm would accept floats and another one wouldn't. So you'd get an exception based on which algorithm you choose. See https://github.com/dotnet/machinelearning/pull/2804 and related issues.

Is there a way in the DataFrame to change type of a column?

Right now DataFrame's logic in ReadCsv is a bit primitive, and doesn't allow a caller to tell it which types the columns should be. You can see the logic here: https://github.com/dotnet/corefxlab/blob/master/src/Microsoft.Data/DataFrame.IO.cs#L25-L31.

Feel free to open an issue about this in https://github.com/dotnet/corefxlab/

cc @pgovind

So what can you do?

I can image 2 different ways of working around this right now.

Fix up the DataFrame to contain a boolean column based on the int column

housingData["populationGreaterThan500"] = housingData["population"] > 500;

In the ML.NET pipeline, use the Conversion transforms to make the new column:

mlContext.Transforms.Conversion.ConvertType(
            "IsFraudBool", "IsFraud", DataKind.Boolean);

eerhardt commented 4 years ago

Closing. Feel free to file new follow up issues for things that aren't working as expected. But in general, passing DataFrame into ML.NET should work without having to load the data again using a TextLoader.

dotnet / machinelearning

Jupyter + ML.NET | DataFrame vs Dataview #4252

So what can you do?