dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.89k forks source link

What is "Slot" (PFI documentation suggestions) #5954

Open torronen opened 3 years ago

torronen commented 3 years ago

I am running the new PFI API (main branch with #5934) for a FastTreeBinary loaded model created by AutoML API.

Main question: I receive items like "Slot 48416" from MLContext.BinaryClassification.PermutationFeatureImportanceNonCalibrated(). I did not find documentation about how to interpret these items. What do they mean? I am stuck with this issue.

As I understand it comes from the features vector, for slots without a name. I am confused about why my features vector has these additional items, and how can I backtrack which original feature they belong to? I have about 2000 features in my dataset.

Side items / suggestion for documentation: I notice there are some logging code in PFI which seem to set the progress of PFI to ProgressHeader, but I could not find documentation on how can I read the progress? pch.SetHeader(new ProgressHeader("processed slots"), e => e.SetProgress(0, processedCnt));

There is also another GitHub issue about the recommended value for permutation and number of examples, and estimation of the running time. It seems the number of examples maybe has higher running time than O(n) but I have still not understood the source or concept of PFI adequately. It would be also useful to know if increasing number of examples or increasing number of permutations would result in more accurate results. Do I understand correctly the accuracy does increase until number of permutations reaches number of features? Increasing number of examples would increase chance the dataset is adequately represented, is it correct?

michaelgsharp commented 3 years ago

So the name "Slot 48416" just comes if there isn't a name for that slot/index in the feature vector column. That can happen for various different reasons, like the original column not having a name, but its also very possible we aren't adding it correctly.

I am interested in the fact that you only have about 2000 features yet it seems like the feature column ends up with a lot more columns then that. Can you check the schema of that column in your pipeline and let me know what it says? We may be able to use this to trace down if there is a bug or something we are missing when either naming the slots or something else. Its also possible it is completely working as intended, we will just need more information to see.

Running time for sure seems to be longer than O(n), though honestly I am not sure what it is. @justinormont may have a better understanding of the time required.

For your other questions at the end I will need to ask a few people. I am not the most familiar with how PFI itself actually works under the hood.

justinormont commented 3 years ago

Unnamed slots There's a variety of transforms which fail to set their slot names.

Fixing on ML․NET dev side -- This is an issue that should be fixed. Ideally each transform would provide good names for each feature created. Alternatively, instead of fixing individual transforms, a less clean but easier fix is naming all slots in only the concat transform, naming its output slots as {inputColumnName} Slot {i}, ex: WeeklyLags Slot 23, for any previously unnamed slots. Ideally, the unnamed-slot names would only be calculated lazily (to reduce memory and model size).

Before a fix is in, you can backtrack the slot's purpose by looking at your concat transform. If it takes in columns {a, b, c} and produces column d, you can get the final size of each of {a, b, c} and calculate which original column and slot that Slot 48416 in column d maps to. Slots in your output column d are simply the array concatenation of each of your input columns {a, b, c}, and in remain order.

Slow PFI Runtime for PFI should be: O(numFeatures * permutationCount * (numberOfExamplesToUse * modelPredictionTimePerRow + metricCalculationTime)).

The modelPredictionTimePerRow generally grows with more features (numFeatures) and more rows of data. The metricCalculationTime is O(numberOfExamplesToUse * log(numberOfExamplesToUse)) for binary classification, due to a sort in the AUC; and O(numberOfExamplesToUse) for other tasks.

Linear model modelPredictionTimePerRow is O(numFeatures), which would make PFI O(numFeatures²).

Trees are a bit more complex for runtime; their modelPredictionTimePerRow is O(numTrees * log(numLeavesPerTree)), which are settable hyperparameters, though both in-turn tend to optimally grow with more features and rows.

Speeding up PFI For trainers which report their model feature weights, like FastTree binary, I'd recommend using that instead of PFI to get the global feature importance. Example code: https://github.com/justinormont/ImgurClassifier/blob/975973cde0f2ed6c7290f718f4052334bd925e22/ImgurClassifier.ConsoleApp/Explainability.cs#L101-L126. PFI has the benefit of being available for all trainers.

To speed up PFI, you can use UseFeatureWeightFilter, which uses the above mentioned model feature weights as a pre-filter (it's a NOOP for models not supporting feature weights). You can also limit numberOfExamplesToUse to use less rows (defaults to all rows; but note if limited, only the first N rows are used not a random sample; you can pre-shuffle before PFA), and set permutationCount to one or a small number of rounds (defaults to 1).

torronen commented 2 years ago

Global Feature Index works perfectly, thank you.

The features in PFI without label seems to be categorical string values. GFI reports the names in format "CityCode.HEL" / columnName.Value.

torronen commented 2 years ago

@michaelgsharp It seems this is creating high count of weights without names: https://github.com/dotnet/machinelearning/blob/305540348efbb70dce2ead0751f4ffb3f7098f33/src/Microsoft.ML.AutoML/TransformInference/TransformInference.cs#L279

I commented the below part, and started getting problems with GFI. In debug inspection I noted count of weights is much higher than count of slot names. I am using the sample with slight modifications from justnormont's link above. https://github.com/dotnet/machinelearning/blob/305540348efbb70dce2ead0751f4ffb3f7098f33/src/Microsoft.ML.AutoML/TransformInference/TransformInference.cs#L258-L263

lastTransformer.Model.SubModel.GetFeatureWeights(ref weights); give very high count of items (in the last dataset something like 200k)

output.Schema["Features"].GetSlotNames(ref slotNames); still gives as expected, in this case 7000.

I did not get into this further yet, I need to first complete the main task. So, it is possible I am misunderstanding something here.

justinormont commented 2 years ago

@torronen One-hot hashing transform has the option of creating slot names: https://github.com/dotnet/machinelearning/blob/0577957256c296fdea2deb6b6e00e7be9b458167/src/Microsoft.ML.Transforms/OneHotHashEncoding.cs#L104-L107

When AutoML creates a one-hot hashing transform, it is not using the MaximumNumberOfInverts parameter: https://github.com/dotnet/machinelearning/blob/305540348efbb70dce2ead0751f4ffb3f7098f33/src/Microsoft.ML.AutoML/EstimatorExtensions/EstimatorExtensions.cs#L221

The default of MaximumNumberOfInverts is 0, which disables the creation of slot names for one-hot hashing. This default is useful as it otherwise increases the model size.

One-hot hashing is used when the cardinality of the column is large; standard one-hot is used for lower cardinalities: https://github.com/dotnet/machinelearning/blob/305540348efbb70dce2ead0751f4ffb3f7098f33/src/Microsoft.ML.AutoML/TransformInference/TransformInference.cs#L258-L267

The slot names are created as: slotNames[HASH(str) % hashBucketLength] += (str + "|") (pseudocode)

Multiple strings can map to the same hash bucket, giving a slot name of cat|dog|fish. And importantly, many buckets will never have value hashed into it from the training dataset. Therefore even with MaximumNumberOfInverts set, many slots will not have a corresponding slot name.

Ideally, any empty slot names would auto-created lazily (as mentioned above) and filled in. This would require a fix to ML․NET.

Instead of using one-hot hashing, if you use the standard one-hot transform, it will produce a slot name for each slot.

michaelgsharp commented 2 years ago

When you say created lazily are you meaning we would figure out which column the slot originally came from? Or if not how so since we won't know what was hashed to get to that slot originally. Right now for PFI (the new API's) if the slot isn't known it just fills in "Slot X".