Closed GoogleCodeExporter closed 8 years ago
[deleted comment]
I want to use a 10-fold cross-validation with your RandomForest,but I am not
sure if
your program could support it?
Original comment by crazy...@126.com
on 28 Feb 2010 at 12:02
Comment 1:
this package works as you will with any matlab function/variable.
say model_RF=classRF_train()
then you can save model_RF to a file via save in matlab
and then load it later and use it again for classRF_predict()
Comment 2:
RF does something called oob, to regulate overfitting
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr
I think you will just be able to give an error rate at the most
You should be able to do something like CV simply in matlab
%assume X is your data and Y the labels
N = size(X,1) %number of examples
num_random_exp = 100 %number of experiments to do
type_of_CV = 10 %for 10 fold
results_array=[] %array to store results
for i=1:num_random_exp
indices = randperm(N); %shuffle the indices
train_indices = indices(1:floor(N-N/type_of_CV));
test_indices = indices(1+floor(N-N/type_of_CV):end);
%training set
X_trn = X(train_indices,:);
Y_trn = Y(train_indices);
%test set
X_tst = X(test_indices,:);
Y_tst = Y(test_indices);
model_RF = classRF_train(X_trn,Y_trn);
Y_hat = classRF_predict(X_tst,model_RF);
results_array(i) = length(find(Y_hat~=Y_tst));
end
error_rate = mean(results_array)
Original comment by abhirana
on 28 Feb 2010 at 12:22
[deleted comment]
About Comment2,I read the ooberr,but in my test ,the same data test in weka3.7
with
CV 10 training.the error_rate only 0.2;and in your program,I got error_rate is
0.51
without CV 10
Original comment by crazy...@126.com
on 28 Feb 2010 at 1:24
are you sure that number of trees and mtry are the same. sometimes 1000 trees
are the
least number of trees that you need.
also affecting the results are if weka is considering categorical data as in
the
current form you have to define it explicitly in the program.
Original comment by abhirana
on 28 Feb 2010 at 1:26
Abhirana,thanks for you help.I am testing it,and then I will report it to you.
Original comment by crazy...@126.com
on 28 Feb 2010 at 1:36
I use cv-10,trees =100,mtry=6,error_rate=0.238068 little higer than wake.
Under wake:trees = 10,mtry = 6,error_rate = 0.1932.
Original comment by crazy...@126.com
on 28 Feb 2010 at 8:53
donot consider trees less than 1000 (or atleast 500 if your data is large) to
come up
with results.
random forests are very dependent on random numbers and many a times you need
to have
atleast 1000 trees before the forest stabilizes
Original comment by abhirana
on 28 Feb 2010 at 9:00
My test data more than 20000,if I use tress =1000,it will run very slowly.and I
will
try it,but I am affraid it maybe produced "out memory" error in matlab.
Original comment by crazy...@126.com
on 28 Feb 2010 at 9:09
oh. it might run slowly, but i have routinely used files of large sizes like
around
30000+ data points and 50 dimensions on a reasonable machine
Try a short example: gradually increase the number of trees and plot the oob
rate
If your oob rate still seem to go lower on increasing the number of trees,it
means
your tree has not yet stabilized. After a while the oob rate will stabilize and
that
many trees are atleast required to get a decent stable answer.
See the example 16 in the tutorial. the dataset is simple and around 100 trees
suffice to bring a steady oob rate
% example 16: getting the OOB rate. model will have errtr whose first
% column is the OOB rate. and the second column is for the 1-st class and
% so on
model = classRF_train(X_trn,Y_trn);
Y_hat = classRF_predict(X_tst,model);
fprintf('\nexample 16: error rate %f\n',
length(find(Y_hat~=Y_tst))/length(Y_tst));
figure('Name','OOB error rate');
plot(model.errtr(:,1)); title('OOB error rate'); xlabel('iteration (# trees)');
ylabel('OOB error rate');
Original comment by abhirana
on 28 Feb 2010 at 9:23
Hi,abhirana.I use trees =500 train my dataset,and I got error_rate 0.003,but
use this
model to test another dataset,I got error_rate 0.240596.And I upload a oob rate
plot
for you.
when I use tress =1000,the error "Out of memory" happened.And I can't solve
this
error.If you have any good idea?My train dataset is 25000x42.
Original comment by crazy...@126.com
on 28 Feb 2010 at 3:59
Attachments:
well, if you are getting worse error rates on a different dataset, it might
just mean
that these two datasets are a bit different.
hmm, 500 trees should be good enough, but in case if you are looking for larger
number of trees, you should look into getting a 64bit OS with 64bit Matlab. I
think
you are using a 32bit OS that has a limit of only allowing around 2GB of memory
for a
process. http://support.microsoft.com/kb/555223
Original comment by abhirana
on 28 Feb 2010 at 8:21
That's right,I am using a 32bit OS.I want to use PCA or ICA to reduce the
dimension
of the datasets,maybe it could get good result.
Original comment by crazy...@126.com
on 1 Mar 2010 at 12:55
well, i donot think that you have too many dimensions (and thus PCA and ICA
might
just remove out important dimensions). 40 are a reasonable bunch and you have
lots
and lots of examples. Dimensionality reduction usually helps if you have lots
of
dimension but not that many examples in comparison.
do try out SVM (like http://www.csie.ntu.edu.tw/~cjlin/libsvm/ or
http://svmlight.joachims.org/ toolbox) and see if they help in your case. That
will
allow you to build a baseline accuracy comparison against RandomF (try out both
linear and non-linear svm and note the results). Sometimes data just doesnt
have
enough information (or the test and training are too different) either in terms
of
variety of examples or features to make prediction better. if comparing with
svms
give you approximately the same results, then it means that you might have to
look
into getting more types of data.
Original comment by abhirana
on 1 Mar 2010 at 1:04
Original comment by abhirana
on 17 Mar 2010 at 6:22
Original issue reported on code.google.com by
crazy...@126.com
on 27 Feb 2010 at 2:29