dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.95k stars 1.86k forks source link

AutoML execute with preFeaturizer should accept Int32 and so #4044

Open baruchiro opened 5 years ago

baruchiro commented 5 years ago

System information

Microsoft.ML.AutoML (0.14.0)

Issue

TL;DR: ExperimentBase.Execute with non-null preFeaturizer argument should check the schema types after transforming the preFeaturizer.


I have a flattened object with long and int fields that I load it to an IDataView. If I want to Execute an experiment for this dataView, I get this expeption:

System.ArgumentException: 'Only supported feature column types are Boolean, Single, and String. Please change the feature column Feature1 of type Int64 to one of the supported types. Parameter name: trainData'

So, I have to create an EstimatorChain to ConvertType from long to Single, and Fit then Transform the dataView.

Let's say I have this EstimatorChain to transform these types. Now I have two options:

  1. Transform the dataView before passing it to the Execute method.
    With this option, the problem is that I have to create a class that fit to the new schema, if I want to save the model and use it latter.
    (The first generic type in CreatePredictionEngine must be appropriated to the inputSchema)
  2. Pass this EstimatorChain as preFeaturizer argument in the Execute method.
    But this is not a real solution because the Execute method still throws the exception above!
acrigney commented 5 years ago

I am having the same problem. 1.With the 1st solution I get an error as like this as if the transformer is not working. 2 With the prefeatureizer I am still getting the same error again as if the internal transformer is not working and when I check the object it seems that the append is not working as I only have the drop column transform in there. This is very frustrating. here is my prefeatureizer but when I test it I am only ever able to add the DropColumn

private IEstimator GetPreFeatureizer() { PropertyInfo propertyInfo; // STEP 2: Build a pre-featurizer for use in the AutoML experiment. // (Internally, AutoML uses one or more train/validation data splits to // evaluate the models it produces. The pre-featurizer is fit only on the // training data split to produce a trained transform. Then, the trained transform // is applied to both the train and validation data splits.)

        IEstimator<ITransformer> preFeatureizer = _mlContext.Transforms.DropColumns(_modelInput.KeyFeatureToIgnore);

        foreach (string feature in _includedFeatureNames)
        {
            propertyInfo = _allFeaturesPropertyInfo.Find(x => x.Name == feature);
            if (typeof(Double) == propertyInfo.PropertyType)
            {
                preFeatureizer.Append(_mlContext.Transforms.Conversion.ConvertType(feature, feature, DataKind.Single));                    
            }
            preFeatureizer.Append(_mlContext.Transforms.NormalizeMeanVariance(feature, useCdf: false));                
        }
        preFeatureizer.AppendCacheCheckpoint(_mlContext);

        return (preFeatureizer);
    }
justinormont commented 4 years ago

@baruchiro: Quite right. We should check the datatype after the pre-featurizer is applied.

Another possible route is automatic conversion from long to Single within AutoML. This route would take some thought, as this can be a little bit dangerous to do automatically as the mapping is only 1-to-1 when within ± 2^24+1. For instance, this would negatively affect someone forecasting the sales of products using a UPC/EAN number as the conversion would be lossy.

acrigney commented 4 years ago

So you are going to do a fix?