about the unbalanced data

GoogleCodeExporter commented 9 years ago

Hi,abhirana .

In my case, I have a very unbalanced dataset. For taining data, we have 300 
'+1' and 300 '-1' instances.  But the data is very unbalanced in the testing 
data which has 141 '+1' and 18000 '-1' instances.

I also note that the 'extra_options.classwt' and 'extra_options.cutoff' are 
lelated to my case. I guess that the 'classwt' adjust the  priors possibility 
of the class label and the ' cutoff' control the vote in the final stage to 
make a decision.

So in my case, whcih parameter do I need to adjust or both of them ?

kindly guide me ,plz.

Thanks.

Original issue reported on code.google.com by zhangleu...@gmail.com on 16 Jan 2013 at 8:47

GoogleCodeExporter commented 9 years ago

Hi

i actually have an issue with the data distribution.
the distribution of examples in training and test is assumed to be similar 
which it doesn't seem like in your case. 

classwt is usually useful if you are trying to get misclassification down for 
one class like in classifying a cancer class with a non-cancer class where 
false negative is way worse than false positive. whereas cutoff is useful if 
you want to tweak the probability of the classes.

tweaking via cutoff or classwt may be useful if you have skewed training 
distribution, but in your case you have a skewed test distribution.
i think you should look into those cutoff/classwt only if you are getting bad 
test error rate (not the absolute accuracy but like precision/recall or some 
way to normalize the test error based on how many examples are there in each 
class) maybe 300/300 splits are good enough for a very good test accuracy.

Original comment by abhirana on 17 Jan 2013 at 5:48

GoogleCodeExporter commented 9 years ago

Yes. For test accuracy ,it may be good. An extreme case is an classifier 
predict all my test sample to '-1', then the accuracy will be very high. So do 
I need to use some sampling method to have a skewed training dataset?

Original comment by zhangleu...@gmail.com on 17 Jan 2013 at 5:54

GoogleCodeExporter commented 9 years ago

sure a classifier can predict all test samples as -1, but then in cases like 
this instead of using accuracy you can use precision recall or a normalized 
accuracy based on the number of examples in each class and that will reflect 
the skewness in the test data.

are you saying that you want to sample your training example so that RF samples 
your training data like your test examples? you can tweak cutoff to change the 
probabilities but i will strongly advise against that before checking whether 
your existing data is not already giving you nice precision/recall values.

Original comment by abhirana on 17 Jan 2013 at 6:00

GoogleCodeExporter commented 9 years ago

Yes, I mean that. But if the training data is balanced , then there are many 
FP(false positive) for the testing data, which means I will get a poor 
specificity score.

Original comment by zhangleu...@gmail.com on 17 Jan 2013 at 6:04

GoogleCodeExporter commented 9 years ago

yeh both classwt and cutoff can be used

classwt can be tweaked too so that you make sure that the smaller class is 
always fully classified as much as possible. mispredicting smaller class has a 
higher penalty

you can also try changing cutoff. i am guessing you will be evaluating the 
normalized accuracy? (accuracy on class 1 + accuracy on class 2) / 2

Original comment by abhirana on 17 Jan 2013 at 6:27

GoogleCodeExporter commented 9 years ago

No, I am evaluating the MCC 
coefficient.http://en.wikipedia.org/wiki/Matthews_correlation_coefficient.

Very difficult to improve. The Mcc shows that the normal RF is only better than 
random guess.

Original comment by zhangleu...@gmail.com on 17 Jan 2013 at 6:33

GoogleCodeExporter commented 9 years ago

hmm, cool. good to know something new :)

do tell me if you make headways using cutoff/classwt; your problem space is 
hard but very relevant.

Original comment by abhirana on 17 Jan 2013 at 6:40

GoogleCodeExporter commented 9 years ago

hmm. If I try to modify the classwt/cutoff. The MCC will get improvement a 
little. But what puzzle me is that there will be less positive lables after the 
prediction.

Original comment by zhangleu...@gmail.com on 17 Jan 2013 at 6:44

GoogleCodeExporter commented 9 years ago

that is true, there is no free lunch.

Original comment by abhirana on 17 Jan 2013 at 6:52

GoogleCodeExporter commented 9 years ago

Hi,abhirana.
I still remember there is a RF package which can calculate the proximity 
between the training and testing data.like something like:
extra_options.proximity=1.
model=classRF_train(X_trn,Y_trn,ntree,mtry,extra_options,X_tst,Y_tst);

But I can not find this package on you webpage. Is this stillavailable?

Original comment by zhangleu...@gmail.com on 18 Jan 2013 at 2:18

GoogleCodeExporter commented 9 years ago

its still there, i think you will have to sync with the source ( i might have 
generated a package in one of the issues, if you want a precompiled package 
just tell?)

tutorial file
http://code.google.com/p/randomforest-matlab/source/browse/trunk/RF_Class_C/tuto
rial_Proximity_training_test.m

i think you asked me about it before here
http://code.google.com/p/randomforest-matlab/issues/detail?id=44#c15

Original comment by abhirana on 21 Jan 2013 at 5:10

GoogleCodeExporter commented 9 years ago

Yes. I want to have a precompiled package for 64bit windows. I need the version 
that can calculate the proximity between the training and testing dataset. 

I aklso found the latest package can not handle this issue. Maybe removed 
because it needs lots of memory?

Thanks.

Original comment by zhangleu...@gmail.com on 21 Jan 2013 at 8:02

GoogleCodeExporter commented 9 years ago

if you sync to the SVN source, you will also get the latest compiled mex files 
for both 32 bits and 64 bits

i just synced and the tutorial_Proximity_training_test.m works with that code.

yup, the proximity matrices will require space of about Ntrn^2*sizeof(double) + 
Ntst^2*sizeof(double) i guess you are passing too many examples perhaps?

Original comment by abhirana on 22 Jan 2013 at 9:58

GoogleCodeExporter commented 9 years ago

Hi,abhirana,

I am still not sure about the effect of ' classwt' and ' cutofff'. I guess 
'cutoff' only take effect at the final stage of Random Forest. i.e., for '-1' 
class we have 200 votes and 300 for the '+1' class, so the final result is '+1' 
if 'cutoff' is set by default. If we set 'cut off' as [ 3/4, 1/4], which means 
the first class needs much less vote to win. Is that true?

And how ahout the classwt?? what effect does it have on the Random Forest?

Original comment by zhangleu...@gmail.com on 31 Jan 2013 at 6:35

GoogleCodeExporter commented 9 years ago

yeh, cutoff will behave as you mentioned.

classwt does things differently as internally during *training* instead of 
assigning misclassification penalty to be the same among classes, the forest 
will try to reduce misclassification of the class whose penalty is higher. i 
used it in cases where getting a true positive about a class was way more 
important then getting a false positive.

Original comment by abhirana on 3 Feb 2013 at 10:31

GoogleCodeExporter commented 9 years ago

you mean uring the training , RF use classwt to assigning misclassification 
penalty?
 But as far as I know, RF trains a lot of CART. Each cart will be filly grown without any prune and in each node it tries to find a best split variable using some criteria ( such as Gini, info gain, etc.). which step dose this 'penalty ' take place?

Original comment by zhangleu...@gmail.com on 3 Feb 2013 at 2:06

GoogleCodeExporter commented 9 years ago

i might be incorrect in saying it. 

classwt to used to influence where the split is made

http://code.google.com/p/randomforest-matlab/source/browse/trunk/RF_Class_C/src/
classRF.cpp#371

it changes the number of examples present in each of the classes (by changing 
tclasspop) and that will influence the split later on. 

and even though the carts are fully grown without pruning it may not mean that 
all the training examples are totally classified (its because the tree size is 
dictated by nodesize/nrnodes and though nodesize is set to default at 1 for 
classification, nrnodes influences the depth of the forest and i seen it 
restricting trees to not go to a large depth and preventing it to classify all 
training data (training_data ~=0)) if you want to check thats happening to your 
data you can probably look at the examples which were inbag and see the labels 
assigned for them.

Original comment by abhirana on 3 Feb 2013 at 6:29

GoogleCodeExporter commented 9 years ago

Can I understand it as following:
For example, at node t ,we all have 100 samples (50 '+1' and 50 '-1')to split 
say if we want to make the split rule be Gini purety, which means 
I(t)=Sigma[p(j/t)*p(i/t)],where t is the node and i~=j,and the accumulation 
will search all different i and j. If we do not use classwt in this case, then 
p(-1/t)=p(1/t)=1/2.
But if we set classwt=[1,3], then it seems to be we will still have 50 '-1' and 
150 '+1' then p(-1/t)=1/4 and p(1/t)=3/4???
Is that true?

Original comment by zhangleu...@gmail.com on 4 Feb 2013 at 2:07

GoogleCodeExporter commented 9 years ago

i guess that is how it will work. i dont remember if classwt was direct like 
you said or inverse, but yeah the probabilities will be skewed

Original comment by abhirana on 4 Feb 2013 at 9:45

GoogleCodeExporter commented 9 years ago

Ok. Thanks .
By the way, do you knw why Random Forest do not need to prune each CART?

I think we do need to prune the tree in CART algorithm.  But when it comes to 
RF, why it become unnecessary?

Original comment by zhangleu...@gmail.com on 4 Feb 2013 at 1:34

GoogleCodeExporter commented 9 years ago

RF trees are more unstable than bagging trees and  (due to mtry<<D) and each RF 
tree is different from other RF trees due to bootstrapping (in turn bagging 
trees from CART trees)

i think there is an argument that pruning is required to reduce the overfit (by 
reducing the bias of the trees) but as RF trees primarily have a low bias (due 
to inclusion of mtry<<D) and slightly high variance and RF forest is a low bias 
and low variance classifier (due to the properties of the ensemble trees), 
pruning the tree wouldn't reduce it further and maybe even increase the 
variance (probability of two small trees to give the very similar answer/or 
have the same splits is higher than of a larger version of those trees). 
empirically pruning RF trees doesnt seem to help too.

Original comment by abhirana on 4 Feb 2013 at 7:59

GoogleCodeExporter commented 9 years ago

That should be the reason. Thanks a lot.

Original comment by zhangleu...@gmail.com on 5 Feb 2013 at 1:26

GoogleCodeExporter commented 9 years ago

Hi Zhang

Do you know if the relationship of classwt and population is direct or inverse? 
I mean if 10% of the samples are class (-1) and 90% of the samples are class 
(+1) and the missclassification of class (-1) is more expensive, how do you 
create the classwt vector to input the prior knowledge of population into 
random forest? do you create the classwt as [0.1 0.9] corresponding to labels 
[-1 1] or do you create it as [0.9 0.1]?

Thanks

Original comment by m.saleh....@gmail.com on 23 Jul 2013 at 6:44

GoogleCodeExporter commented 9 years ago

Hi, it has been a long time since I touch about the RF issues. I think in your 
case it should be [0.9 0.1]. Also you can check and look into src/rfutils.cpp 
in normClassWt() function.

Original comment by zhangleu...@gmail.com on 24 Jul 2013 at 1:22

GoogleCodeExporter commented 9 years ago

Hi,

I have a balanced training set and the cross validation error is very low. But 
my test data is heavily skewed as mentioned by Zhang previously. There's only 1 
+ve among 9216 values. I know this is very very heavily skewed. But the nature 
of data is such in my case. So, what should be the values of class wt and 
cutoff? (Note that my training set is balanced!)
As the labels are [-1 1] should the class wt be [.9 .1]? and cut off be [.1 
.9]? (From the discussion between Abhirana and Zhang, this is what I 
understood. Please correct me.)

Original comment by sharathc...@gmail.com on 31 Oct 2013 at 6:26

GoogleCodeExporter commented 9 years ago

@sharathchandra92

i am a bit confused. training and test set should ideally have similar class 
probabilities, which is not in your case. what is your end goal for a test? 
making sure that 1+ve is always classified correctly (like +ve has higher 
misclassification cost) or is it high accuracy ?

anyways,
you should try one of them at a time before trying both of them simultaneously. 
you can look at individual class oob error for the effects. classwt will make 
RF train harder on the class (i'll start with this) whereas cutoff just tunes 
the proportions of votes required to win.

for both classwt and cutoff vectors, higher values means that its easier for 
the corresponding class to win compared to the other classes.

Original comment by abhirana on 31 Oct 2013 at 7:47

GoogleCodeExporter commented 9 years ago

Yeah actually they should, but in this case, the problem is slightly tricky in 
its formulation. 

My goal is to minimize any false positives and make sure that +ve is correctly 
classified. Essentially, I have only 1 +ve out of 9216 samples in my test, so I 
cannot miss this +ve and at the same time, some false positives are ok (but not 
more than 5-6)! I am aiming for higher accuracy on the test data, which means 
+ve has higher misclassification cost in training. Am I correct on this? 

I have seen the oob error rates, the figures are attached:
Figure 1
extra_options.classwt=[.05 .95];
   extra_options.cutoff=[.01 .99];
    model = classRF_train(X_trn,Y_trn, 1000, 10, extra_options);

Figure 2
extra_options.classwt=[.01 .99];
   extra_options.cutoff=[.01 .99];
    model = classRF_train(X_trn,Y_trn,2000,7,extra_options);

Figure 4
extra_options.classwt=[.1 .9];
   extra_options.cutoff=[.01 .99];
    model = classRF_train(X_trn,Y_trn,2500,100,extra_options);

Original comment by sharathc...@gmail.com on 1 Nov 2013 at 12:00

Attachments:

GoogleCodeExporter commented 9 years ago

You should probably try to tune classwt (first and foremost). cutoff tends to 
cause too much variation (lets say you set cutoff to 10 - 90, that means that 
-1 will require 9 times more votes to win compared to +1, whereas 50-50 means 
that -1 will require equal amounts of votes to win compared to +1). just want 
to make sure you can decouple the effects of both the factors (classwt-cutoff).

also try to see the per class ooberr rather than overall ooberr (the other 
colmns of the ooberr give the per class ooberr). and compare that per class 
ooberr (with and without classwt). You could plot various values of classwt 
proportions and change in per class ooberr

Original comment by abhirana on 1 Nov 2013 at 12:13

GoogleCodeExporter commented 9 years ago

Yes. I have conducted experiments with multiple settings of classwt and cutoff, 
both individually and together.

There are seemingly contradictory observations! 

For Class 2:
Classwts: 
.2 .8 - Error is going down till .0125
.4 .6 - Down Till 0.0125
.05 .95 - Shooting up till 0.9
.7 .3 - Going down till .01

So, it is kind of confusing which value to take! 

CutOff
.7 .3 - Down till 0.005
.3 .7 - Down till 0.025
.1 .9 - Down till 0.08

Coupling both
Classwt and CutOff
.001 .999 .999 .001 - Down to 0.
.1 .9 .1 .9 - Down to .09
.3 .7 .3 .7 - Down to .01
.05 .95 .01 .99 - Shooting up to .55
.999 .001 .001 .999 - Shooting up to .75

It is a little ambiguous which setting to go for! My concern is like this: I am 
trying to predict which point in an image is the keypoint and the image has 
9216 pixels and 1 of them is the +ve point in test data! So by any chance, I 
should get this right. At the same time, there is a possibility that certain 
neighbouring points will have similar features and might be tagged as keypoint, 
which is Ok. Under these circumstances, Should I just see which setting gives 
me lowest class2 error?

I have attached the graphs in rar file. Sorry wherever it says class3 - that's 
class 2 actually (typo).

Another question I had was, can this implementation do multi class regression?

Original comment by sharathc...@gmail.com on 1 Nov 2013 at 5:49

Attachments:

RF_Figures.rar

GoogleCodeExporter commented 9 years ago

ok. i would use only classwt. i see too much variation in examples using cutoff 
(either as a single factor or when coupling). For those examples, its unnatural 
to see a single (or <10 trees) having low ooberr, then the ooberr increasing 
between 10-100 trees and then decreasing (ooberr ideally should decrease 
monotonously).

can i ask what is the current results (can it correctly classify the single +ve 
example ) on the test dataset without using either cutoff and classwt. can i 
ask you what the final goal is, to get a good ooberr or a good tsterr?

yes, it can do multi class classification.

Original comment by abhirana on 2 Nov 2013 at 12:12

GoogleCodeExporter commented 9 years ago

Yes, that is true, ooberr is fluctuating heavily when cutoff is used. 

Currently without using either cutoff and classwt, I am getting lot of false 
positives. The final goal is to get good testerr. The basic issue with the 
skewed nature of test set. I am trying to see if I can do something about 
decreasing it by changing the features. 

And I was curious if RF can do multi-output regression. I guess you misread it 
as classification. Can you please clarify?

Original comment by sharathc...@gmail.com on 2 Nov 2013 at 12:34

GoogleCodeExporter commented 9 years ago

this RF package cannot perform multi-output regression.

Original comment by abhirana on 2 Nov 2013 at 6:56

etrigger / randomforest-matlab

about the unbalanced data #55