dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

FastTree: Instantiate feature map for disk transpose and make Generalized Additive Models predictor resilient when feature map is not available. #123

Closed codemzs closed 6 years ago

codemzs commented 6 years ago

We drop features from FastTree gradient boosting decision tree during training that offer little to no value such as features that have zero instance count during training or features that don't have enough instance count for unique feature values. Due to this the feature count in training set can be less than or equal to the feature count in the input features vector from the user, hence we use a featuremap internally to map dataset training features to the input features.

Issue# 1: If no features are dropped or filtered during training then feature map is not created. FastTree handles a null featuremap but Generalized Additive Model(GAM) predictor does not.

Issue# 1.1: Before training starts in FastTree we go through a data preparation step where we transpose the dataset and eliminate examples that have missing feature values. The transpose can be done in memory or on disk(recommended for larger dataset). In disk transpose the code was not filtering features that were not supposed to be included in training and it was also not creating a feature map when one was supposed to be created. Hence a null feature map was passed to GAM predictor which was not resilient to it.

markusweimer commented 6 years ago

Can you explain more? The title makes it sound like two separate issues to me.

codemzs commented 6 years ago

@markusweimer: We drop features from FastTree gradient boosting decision tree during training that offer little to no value such as features that have zero instance count during training or features that don't have enough instance count for unique feature values. Due to this the feature count in training set can be less than or equal to the feature count in the input features vector from the user, hence we use a featuremap internally to map dataset training features to the input features.

Issue# 1: If no features are dropped or filtered during training then feature map is not created. FastTree handles a null featuremap but Generalized Additive Model(GAM) predictor does not.

Issue# 1.1: Before training starts in FastTree we go through a data preparation step where we transpose the dataset and eliminate examples that have missing feature values. The transpose can be done in memory or on disk(recommended for larger dataset). In disk transpose the code was not filtering features that were not supposed to be included in training and it was also not creating a feature map when one was supposed to be created. Hence a null feature map was passed to GAM predictor which was not resilient to it.

You are right, they are two issues but they are also related.

shauheen commented 6 years ago

closed by #122