How to save the model?Help me!

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.When I train a RandomForest,How to save a model to predict with the new 
data next time?

Original issue reported on code.google.com by crazy...@126.com on 27 Feb 2010 at 2:29

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I want to use a 10-fold cross-validation with your RandomForest,but I am not 
sure if 
your program could support it?

Original comment by crazy...@126.com on 28 Feb 2010 at 12:02

GoogleCodeExporter commented 9 years ago

Comment 1:
this package works as you will with any matlab function/variable.

say model_RF=classRF_train()

then you can save model_RF to a file via save in matlab

and then load it later and use it again for classRF_predict()

Comment 2:
RF does something called oob, to regulate overfitting 
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr

I think you will just be able to give an error rate at the most

You should be able to do something like CV simply in matlab

%assume X is your data and Y the labels
N = size(X,1) %number of examples
num_random_exp = 100 %number of experiments to do
type_of_CV = 10 %for 10 fold

results_array=[] %array to store results

for i=1:num_random_exp
   indices = randperm(N); %shuffle the indices 
   train_indices = indices(1:floor(N-N/type_of_CV));
   test_indices = indices(1+floor(N-N/type_of_CV):end);

   %training set
   X_trn = X(train_indices,:);
   Y_trn = Y(train_indices);

   %test set
   X_tst = X(test_indices,:);
   Y_tst = Y(test_indices);

   model_RF = classRF_train(X_trn,Y_trn);
   Y_hat = classRF_predict(X_tst,model_RF);

   results_array(i) = length(find(Y_hat~=Y_tst));
end

error_rate = mean(results_array)

Original comment by abhirana on 28 Feb 2010 at 12:22

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

About Comment2,I read the ooberr,but in my test ,the same data test in weka3.7 
with
CV 10 training.the error_rate only 0.2;and in your program,I got error_rate is 
0.51 
without CV 10

Original comment by crazy...@126.com on 28 Feb 2010 at 1:24

GoogleCodeExporter commented 9 years ago

are you sure that number of trees and mtry are the same. sometimes 1000 trees 
are the 
least number of trees that you need.

also affecting the results are if weka is considering categorical data as in 
the 
current form you have to define it explicitly in the program.

Original comment by abhirana on 28 Feb 2010 at 1:26

GoogleCodeExporter commented 9 years ago

Abhirana,thanks for you help.I am testing it,and then I will report it to you.

Original comment by crazy...@126.com on 28 Feb 2010 at 1:36

GoogleCodeExporter commented 9 years ago

I use cv-10,trees =100,mtry=6,error_rate=0.238068 little higer than wake.
Under wake:trees = 10,mtry = 6,error_rate = 0.1932.

Original comment by crazy...@126.com on 28 Feb 2010 at 8:53

GoogleCodeExporter commented 9 years ago

donot consider trees less than 1000 (or atleast 500 if your data is large) to 
come up 
with results.

random forests are very dependent on random numbers and many a times you need 
to have 
atleast 1000 trees  before the forest stabilizes

Original comment by abhirana on 28 Feb 2010 at 9:00

GoogleCodeExporter commented 9 years ago

My test data more than 20000,if I use tress =1000,it will run very slowly.and I 
will
try it,but I am affraid it maybe produced "out memory" error in matlab.

Original comment by crazy...@126.com on 28 Feb 2010 at 9:09

GoogleCodeExporter commented 9 years ago

oh. it might run slowly, but i have routinely used files of large sizes like 
around 
30000+ data points and 50 dimensions on a reasonable machine

Try a short example: gradually increase the number of trees and plot the oob 
rate 

If your oob rate still seem to go lower on increasing the number of trees,it 
means 
your tree has not yet stabilized. After a while the oob rate will stabilize and 
that 
many trees are atleast required to get a decent stable answer. 

See the example 16 in the tutorial. the dataset is simple and around 100 trees 
suffice to bring a steady oob rate

% example 16: getting the OOB rate. model will have errtr whose first
% column is the OOB rate. and the second column is for the 1-st class and
% so on
    model = classRF_train(X_trn,Y_trn);
    Y_hat = classRF_predict(X_tst,model);
    fprintf('\nexample 16: error rate %f\n',   
length(find(Y_hat~=Y_tst))/length(Y_tst));

    figure('Name','OOB error rate');
    plot(model.errtr(:,1)); title('OOB error rate');  xlabel('iteration (# trees)'); 
ylabel('OOB error rate');

Original comment by abhirana on 28 Feb 2010 at 9:23

GoogleCodeExporter commented 9 years ago

Hi,abhirana.I use trees =500 train my dataset,and I got error_rate 0.003,but 
use this
model to test another dataset,I got error_rate 0.240596.And I upload a oob rate 
plot 
for you.

when I use tress =1000,the error "Out of memory" happened.And I can't solve 
this 
error.If you have any good idea?My train dataset is 25000x42.

Original comment by crazy...@126.com on 28 Feb 2010 at 3:59

Attachments:

500t_TestData_oob_error_rate.jpg

GoogleCodeExporter commented 9 years ago

well, if you are getting worse error rates on a different dataset, it might 
just mean 
that these two datasets are a bit different.

hmm, 500 trees should be good enough, but in case if you are looking for larger 
number of trees, you should look into getting a 64bit OS with 64bit Matlab. I 
think 
you are using a 32bit OS that has a limit of only allowing around 2GB of memory 
for a 
process. http://support.microsoft.com/kb/555223

Original comment by abhirana on 28 Feb 2010 at 8:21

GoogleCodeExporter commented 9 years ago

That's right,I am using a 32bit OS.I want to use PCA or ICA to reduce the 
dimension 
of the datasets,maybe it could get good result.

Original comment by crazy...@126.com on 1 Mar 2010 at 12:55

GoogleCodeExporter commented 9 years ago

well, i donot think that you have too many dimensions (and thus PCA and ICA 
might 
just remove out important dimensions). 40 are a reasonable bunch and you have 
lots 
and lots of examples. Dimensionality reduction usually helps if you have lots 
of 
dimension but not that many examples in comparison.

do try out SVM (like http://www.csie.ntu.edu.tw/~cjlin/libsvm/ or 
http://svmlight.joachims.org/ toolbox) and see if they help in your case. That 
will 
allow you to build a baseline accuracy comparison against RandomF (try out both 
linear and non-linear svm and note the results). Sometimes data just doesnt 
have 
enough information (or the test and training are too different) either in terms 
of 
variety of examples or features to make prediction better. if comparing with 
svms 
give you approximately the same results, then it means that you might have to 
look 
into getting more types of data.

Original comment by abhirana on 1 Mar 2010 at 1:04

GoogleCodeExporter commented 9 years ago

Original comment by abhirana on 17 Mar 2010 at 6:22

Changed state: Done

devmax / randomforest-matlab

How to save the model?Help me! #5