aloysius-lim / bigrf

Random forests for R for large data sets, optimized with parallel tree-growing and disk-based memory
91 stars 26 forks source link

Very imbalance data trained with weights show good OOB errors, but bad predictions. #5

Closed robertthebob closed 10 years ago

robertthebob commented 10 years ago

I have a very imbalanced data set, where the minority data has about 80000 samples, and, to get the best information, more than 2,500,000 majority samples. So the weights shown below achieve a good balance in TP and TN, about 4% for each. This is very good for this data, and the AUC is 0.987. However, when the prediction method is used, the errors are far from the OOB errors, perhaps 5% for the minority, but 60% for the majority when samples are predicted that have not been through the trainer.

This does not happen when I use balanced data.

forest25 <- bigrfc( x=BMdatb, y=FACTdatb, ntree=48L, varnlevels=varn, cachepath=NULL,yclasswts=c(17.0,1.0), printerrfreq = 1L,trace = 1) save(forest25,file="/mnt/ssd/dp/bigrf2/forest.25.12jan2014.RData") system("sh /home/bob/R/2013/dp/clearshm.sh") summary(forest25);rm(forest25)

repeated over and over, and then merged,

the OOB errors become very small.

pred <- predict(forest25, testdata, testfactors, trace=1L)'

gives predictions with large errors

aloysius-lim commented 10 years ago

Thank you for the feedback. I have used bigrf on very imbalanced data as well, and also found that OOB errors for the test set were much higher than for the training set. Prediction accuracy is difficult to achieve for most highly unbalanced data sets, where we are trying to identify the very small minority of records that are different than the rest. Have you tried using other algorithms, and have they produced better results on the test data?

robertthebob commented 10 years ago

Aloysius,

I've been running experiments that show that the OOB estimates of errors are very biased and inaccurate when training weighted unbalanced data. They cannot be used as an accurate guide to achieve good predictions. It is necessary to iterate over a few guessed weights, using the results from "predict" on samples excluded from the training set, to find the best weights. I'm still iterating....but it appears that I may have good results.

I'm experimenting with other methods now. Thanks so much for your response.

Bob

On 01/24/2014 05:23 AM, Aloysius Lim wrote:

Thank you for the feedback. I have used bigrf on very imbalanced data as well, and also found that OOB errors for the test set were much higher than for the training set. Prediction accuracy is difficult to achieve for most highly unbalanced data sets, where we are trying to identify the very small minority of records that are different than the rest. Have you tried using other algorithms, and have they produced better results on the test data?

— Reply to this email directly or view it on GitHub https://github.com/aloysius-lim/bigrf/issues/5#issuecomment-33211847.

aloysius-lim commented 10 years ago

Thanks for the feedback, Bob. You are right, very unbalanced data can throw off the random forest algorithm. With such data, it is best to use a typical train-validate-test setup, where the validation data set is used to choose the best weights.

Mahendra1980 commented 9 years ago

late comment :) , would like to highlight even if the class is highly imbalanced randomForest works very well . only thing we need to do is to do importance sampling using sampsize in base randomForest package

yclasswts=c(17.0,1.0) i assume its the same thing however i m not clear of 17.0 and 1.0 values ?

aloysius-lim commented 9 years ago

No, yclasswts in bigrf() is equivalent to classwt in randomforest(). There is no equivalent of sampsize in bigrf()—you have to select the sample before passing the data to bigrf(), if you want to balance the class probabilities in the training set.

They are slightly different. sampsize, or pre-sampling, is used to balance the distribution of classes in the training data, so only a subset of data is ever seen by the algorithm. yclasswts or classwt, on the other hand, adjusts the votes of the individual decision trees, to affect the outcome—even when the algorithm is given the full, unbalanced training set. You could use both together, but be careful about overfitting.

Mahendra1980 commented 9 years ago

Thanks for calrification , was wondering if my data is below 0 1 69670 891828

How to put class wt in favor of 0 ,lets say 70% towards 0 and 30% towards 1 yclasswts=c(70,30) , did i get it correctly else please advice

Thanks

aloysius-lim commented 9 years ago

Yes, that's correct.

Mahendra1980 commented 9 years ago

Thank you very much , any plans to include sampsize in near future , it would be of great help