Open nganju98 opened 5 years ago
I haven't seen anything regarding data augmentation/upsampling so far either.
I would recommend this method.
ML.NET has instance weights where we can up/down weight certain rows of data. Generally this is done by up weighting an entire class.
Internally, we have an Expression Transform, which makes this very simple:
xf=Expr{ col=Weight:Label expr={ x:(x == 1 ? 10.0f : 1.0f) } }
This sets the Weight Column to 10.0 for Class 1, and 1.0 otherwise (Class 0).
ML.NET has an issue for moving the Expression transform to ML.NET -- https://github.com/dotnet/machinelearning/issues/4015 (update: PR merged)
Before then, this can also be done using ML.NET's CustomMapping transform. The CustomMapping has a bit of overhead when productionizing the model.
I'm not sure the functionality is exposed in ML.NET.
I don't see the generate number transform exposed in ML.NET, which is a step we use in the internal version for downsampling. The generate number transform creates a stable random number per row, then the range filter transform will drop the rows. The stability lets all passes of the dataset include the same sub-set, otherwise each pass of the data will drop a different subset (for better/worse).
I don't see the ungroup transform exposed in ML.NET, which can be use for upsampling.
For up/down sampling, we'll have to take care to ensure the up/down sampling is (1) not included as part of the trained model, (2) done after the train/test or CV split and only on the training set.
Why? (1) otherwise we will delete/add rows just before we predict on the row; (2) as the metrics won't be comparable and for upsampling you leak information between splits.
Often the best method is collecting more data in your dataset. Applies when your dataset is small.
Not available in ML.NET
Users can try alternate learners, ML.NET's AutoML explores this for the user. Users can also try manually.
For multi-class datasets, small classes can be joined to an "other" class to train on. We will simply get all of these wrong by never predicting the actual small class, but this can be better overall.
A variant of this is having a multi-step model. We train a secondary model to predict between the various small classes. Then when the primary model guesses "other" we then call the secondary model to decide which "other". This allows the secondary model to be trained on classes with a similar number of rows each.
We use a similar pattern internally for some models where we train two models: large classes against large classes, and small against small.
We may want to create an AutoWeight transform in ML.NET, which handles the difficulties for the user. We can have it create a weight column, or up/down sample the classes.
There's many formulas we can use, though I at times generate an auto-weight column as the IDF value of the Label:
xf=Term{col=WeightKey:Label}
xf=CSharp{in=WeightKey out=WeightStr:TX code={O.WeightStr = (I.WeightKey == 0 ? "missing" : (I.WeightKey - 1).ToString());}}
xf=WordBagTransform{col=WeightVec:WeightStr tok=WordTokenizeTransform weighting=Idf}
xf=CSharp{in=WeightVec out=Weight:R4 code={O.Weight = I.WeightVec.Sum();}}
This up-weights small classes, but only to ~ 1/ln(numInstances). Doing a direct 1/(numInstances) can lead to very small classes dominating.
Handling the imbalance can cause your metrics to get worse. Generally you're causing the model to focus on the smaller classes, at the detriment of the large classes. This helps Macro-Accuracy, but generally decreases normal Accuracy.
Imbalanced data is not a problem in most cases. Our learners are generally ok with imbalance at 20:1. Generally I take action when the model demonstrates poor performance, the extreme example is a model always guessing the majority class.
Choose your business metrics first. For instance if Accuracy is best aligned with your business metric, choose it. But for highly imbalanced datasets Accuracy isn't great for the data science iteration process (noisy, stair case like, and dominated by the majority class). While you're working, watch a smooth metric good for imbalance, like AUPR or log-loss reduction (cross-entropy), then choose your final model on your business metric.
@justinormont We're dealing with a hugely imbalanced data set right now and I found this issue while searching the repo.
Thank you for putting so much detail into the post above. Unfortunately we are having trouble implementing the approaches mentioned above.
Context
Binary classification via field aware factorization machine to classifiy if a purchase will be made or not (recommendation engine). We're running into two kinds of imbalance in then dataset.
Things i've tried from above above (and elsewhere).
Weights
I tried setting a weight property in my model and specifying that column name as the exampleWeightColumnName (like in this suggestion). I get an error using float as the datatype for the weight (Vector
I also don't see any examples of setting weights on a field aware factorization machine anywhere in the repo or even in the larger internet (my google fu has failed) which seems unusual.
Sampling
Like you mentioned, there is no way to do this out of the box or even manually due to potential data leakage. Cross Validation doesn't allow for applying any transformers on the train set AFTER the splits.
Do you have any ideas on how to proceed or do you have concrete examples of setting the weight for field aware factorization machine?
Thanks!
Possibly related to #4396 . @mayoatte can you share your code so I can take a look at why the weight approach is failing you?
Just wondering if there are new ways to handle imbalanced classification in ML.Net since the original post a year ago. Methods like SMOTE or ADASYN. If not, how to best handle them via external ways, without leaving .Net too much.
Upsampling is a common practice for unbalanced data sets. The best practice for upsampling is to upsample the training set AFTER splitting into train and test sets, because if you duplicate rows before splitting, you'll get identical rows in the train and test set. This is data leakage from the train to the test set, and if you overfit the training set it will be partially hidden by the identical rows in the test set pumping up the scores.
There is no Upsampling transformer, so far as I can find. If I use the built-in TrainTestSplit method, I get two IDataViews back. Now I can't upsample by adding/duplicating rows in the training set because IDataView is immutable.
So basically I have to load from text file myself, because if I use TextLoader I get an IDataView which puts me in the same predicament. Next, I have to reproduce the functionality in TrainTestSplit(), in order to split my original dataset myself into train and test split. Next, I have to upsample the training set myself, and then remember to separately run both the train and test split through the regular data pipeline I've created.
I can't see how the CustomTransform can be used to Upsample. How would you suggest is the correct way to upsample a data set with ML.Net? Like a nice, convenient way. Not an arduous solution like I mentioned above, which will just drive people back to python instead.