Set all feature values by setting a single float[] Features property on ModelInput

andrasfuchs commented 2 years ago

Is your feature request related to a problem? Please describe. When I run the trained models I always need to fill the properties of ModelInput that represent an input value one-by-one. Since I typically have 5000+ input values, I use reflection to do the property setting. Although it works, it would be much better to have a Features float array on the ModelInput, just like we have on the ModelOutput.

Describe the solution you'd like I'd like to have a generated, writable float[] Features property on the ModelInput class to set the input values.

Describe alternatives you've considered I use Reflection now to set the values one-by-one.

Additional context I have big dataset with many feature columns.

beccamc commented 2 years ago

Possibly dataframes as well. See related https://github.com/dotnet/machinelearning-modelbuilder/issues/1973

torronen commented 2 years ago

I agree this is one of the pain point of consuming the models.

Here is a summary of what I am doing with DataFrames (actual code is a bit too messy, pending refactoring). It avoids Reflection but is not much simpler. Benefit for me is that I can still run models where the datatypes do not match (Single 1/0 vs. Boolean). It probably is possible to add the DataRow at once (=float[]), but creation of DataColumns would still be needed. Happy if someone points out some thing we could do better.

One concern I have with float[] is how to ensure the order is correct. Mistake of removing or adding one datapoint could easily happen.

Load Model with schema. Method needs modelZip as parameter. In my case, I send the data as a dictionary for single item, or CSV filename for batches. model = mlContext.Model.Load(modelZip, out modelInputSchema);

Loop each column in schema foreach(DataViewSchema.Column? schemaCol in modelInputSchema)

In case any expectation to need conversion (usually 1/0 => boolean or "true"/"false" => boolean)

switch (c.RawType.FullName)
 case "System.Single":
 case "System.Boolean":
default: // String

Create column for each variable with just 1 data point. Can run multiple predictions by replacing 1 with the array of the values. If needed, this code would do casting inside the switch above.

In my case I need to check if my data has data with that column, or if it was remove. If not, in my case I create a dummy variable. If doing that it is important to validate the model with the missing columns. I do that by running new metrics for all current models with similar code. Someone else might want to throw an exception to not use models with missing data.

DataFrameColumn dfcN = new SingleDataFrameColumn(schemaCol.Value.Name, 1);
Single dfcValueN = 0; // TODO: put value here
dfcN[0] = dfcValueN;
df.Columns.Add(dfcN);

Get results // predict, lazy loaded. ToList() may be fastest var transformed = model.Transform(df).ToList();

// read results. Remove Single, if multiple items Single score = transformed .GetColumn<float>("Score").Single(); // not all models have calibration, but some have probability

if (transformed .Schema.Any(x => x.Name == "Probability"))
{
    Single prob = transformedSubmodel.GetColumn<float>("Probability").Single();
}

torronen commented 2 years ago

@andrasfuchs You have a cool project! BTW, not sure if helpful, but just in case: if you are using Model builder for the device and get good metrics in Model Builder but not for live measurements, I think you might also need the Sampling Key support (https://github.com/dotnet/machinelearning-modelbuilder/issues/1873 maybe personid or measurementSessionId as sampling key) or switch to ML.NET class library for the time being. At least that has been my experience. Some (FastForest, maybe sgd and others) might be okay, but especially if ModelBuilder gives FastTree or LightGBM as top choice, the risk of overfit may be high based on my experience with different datasets. It was disappointing to get high accuracy scores, which did not materialize. So, just wanted to point out just in case it could help avoid some frustration.

torronen commented 2 years ago

One more point to consider is that the schema in Model files created by Model Builder include (unless some recent change) all ignored columns. Ignored columns are good candidates from being removed from the dataset. If I try to feed DataFrame or input class without ignored columns present, I get an exception.

Thus, I believe, it is unnecessary to require them to be present in the input (ModelInput class or DataFrame). Otherwise, developers are forced to feed dummy data for ignored columns.

As I remember OP's another post and issue #1939 are also about ignored columns, more specifically ignored secondary labels, so it may be fairly common to have ignored columns.

andrasfuchs commented 2 years ago

@torronen Thank you! Overfitting might a problem in my case too, but last time when I was surprised how badly the model made its predictions was because of a FastForestOva-related bug in ML.NET, that I reported in issue #6037. You could have run into it without realizing it, so I thought I mention it ;)

torronen commented 2 years ago

@andrasfuchs thanks! It is actually likely that may impact at least some of my code. I need to review my code and the other trainers next week. Luckily, most of my code is BinaryTrainers which only give one score.

torronen commented 2 years ago

@beccamc There is internal sealed class ArrayDataViewBuilder which takes floats and might be exposed in public API to allow feeding float[][] to create a IDataView. It might be pretty close to @andrasfuchs request.

Current usage in source:

           Single[] targets = ....
            Single[][] features = ....
            ArrayDataViewBuilder dvBuilder = new ArrayDataViewBuilder(_context);
            dvBuilder.AddColumn(DefaultColumnNames.Label, NumberDataViewType.Single, targets);
            dvBuilder.AddColumn(DefaultColumnNames.Features, NumberDataViewType.Single, features);

On other hand, I think Microsoft.ML team does not seem in favor of making things public. Would one option be to take copy of some of the "useful" staff and put it in a community Nuget Package? Do you know where to find restrictions on how to name such "I think this is useful stuff & custom helpers" packages, such as keep or drop "Microsoft.ML" as part of the name?

beccamc commented 2 years ago

Best change here would be to generate a predict method that takes a DataFrame.

andrasfuchs commented 2 years ago

Is there any chance to have this in the next release?

dotnet / machinelearning-modelbuilder

Set all feature values by setting a single float[] Features property on ModelInput #1975