Addition of new training examples to an existing RF classifier?

GoogleCodeExporter commented 9 years ago

Hello,

first of all, thank you very much for this code, I've been using it more and 
more for various research projects and will hopefully soon be able to cite it 
in a paper!

I was wondering if there was a way to add additional training examples to a 
previously trained RF classifier (using the same set of features, of course). I 
am interested in creating an interactive classification tool and being able to 
add additional examples without having to re-train the whole classifier would 
be _very_ useful!

I haven't investigated the fundamental aspect of random forests yet so it might 
be obvious that it is impossible but I thought it would be easier to ask before 
trying to figure it out by myself.

Thanks again for this code!

Regards,

Nicolas

Original issue reported on code.google.com by nicolas....@gmail.com on 7 Sep 2012 at 3:59

GoogleCodeExporter commented 9 years ago

I didn't mean to submit that this a defect by the way!

Original comment by nicolas....@gmail.com on 7 Sep 2012 at 4:00

GoogleCodeExporter commented 9 years ago

no issues. its a defect with almost all classifiers i know of :)

the package doesn't allow you to do that.

in theory, it may or maynot be possible with randomforests or decision trees in 
general. 

the trees split the examples based on the tgini value and for new examples one 
could always see if the examples is perfectly classified with the existing tree 
then no change is required to the tree (usually) and if the examples is not 
classified correctly then recreation of some branches of the tree may be done. 
or one could recalculate the tgini value at each node of the tree that the 
example passes through and if the value is > some threshold recreate that part 
of the tree. this will require some saving of the current tgini values at the 
nodes.

but then randomforests are greedy and approximated so you may be able to get 
away with lots of approximation any not require any kind of retraining unless 
you encounter lots of new training examples and the distribution of the data 
changes.

i vaguely remember a paper in the time-series domain where in order to learn to 
the drift, the authors retrain a bunch of trees, discard a bunch of trees from 
the original forest and consider the new tree bunch as a forest trained on the 
current set of data.

Original comment by abhirana on 7 Sep 2012 at 6:05

GoogleCodeExporter commented 9 years ago

Thanks for the reply!

After I posted this, I stumbled upon MATLAB's built-in ensemble classifiers and 
it appears it has method to merge and append to bagged trees 
(http://www.mathworks.co.uk/help/toolbox/stats/treebagger.append.html). I found 
the built-in methods to be much slower than your implementation but I guess it 
might be worth a try. The document is very succinct, any idea how these methods 
work?

Cheers,

Nicolas

Original comment by nicolas....@gmail.com on 7 Sep 2012 at 6:10

GoogleCodeExporter commented 9 years ago

Another quick question: I trained very large classifiers (~10 million examples, 
28 features). The memory footprint is pretty big (easily fills my 16GB of ram). 
I was wondering if there were fields in the structure that I could safely 
discard to reduce the size in memory?

Original comment by nicolas....@gmail.com on 7 Sep 2012 at 6:17

GoogleCodeExporter commented 9 years ago

cool. 

looks like they are doing tree level merging to create a new forest, like the 
paper i mentioned about time-series random forests. i don't think they are 
removing existing trees from the forest, only adding more trees.

i could add the append/remove function if you want? something like 
append/remove(forest1, forest2, forest1_tree_indices, forest2_tree_indices)

the issue then will be how to calculate the oob error (which depends on what 
training example you have used and cannot be propagated if the training 
examples are the same)

about the memory requirement stuff, actually right now there is no 
straightforward fix to that, the issue is that the arrays are stored in a ntree 
x number nodes  order size and most trees are much smaller than number nodes 
size and because of that memory is wasted, if the array are stored as number 
nodes x ntree size then one can dynamically increase the node size if a tree 
gets bigger. actually come to think about it append/remove maynot be possible 
due to this memory arrangement. 

i can give it a try if you cannot find a reasonable solution.

Original comment by abhirana on 7 Sep 2012 at 6:26

GoogleCodeExporter commented 9 years ago

Well if you could give it a try it would be great! I will evaluate the built-in 
ensemble methods but my preliminary tests were not very encouraging sadly.

Original comment by nicolas....@gmail.com on 7 Sep 2012 at 6:43

GoogleCodeExporter commented 9 years ago

ok, let me give that a try, it will take me about a week or so to get it done.

Original comment by abhirana on 7 Sep 2012 at 6:47

GoogleCodeExporter commented 9 years ago

Thanks, much appreciated!

Original comment by nicolas....@gmail.com on 17 Sep 2012 at 5:13

GoogleCodeExporter commented 9 years ago

Abhirana,

I was wondering if you had any chance to try implementing such a feature?

Thanks!

Nicolas

Original comment by nicolas....@gmail.com on 17 Jan 2013 at 1:43

GoogleCodeExporter commented 9 years ago

thanks nicolas for reminding. i apologize for the delay. 

i found out that it will require a major rewriting of the code. i haven't had a 
chance to make major changes and test them yet. (i am about planning to defend 
my phd in a month or two so thats what distracting me for the timebeing)

i'll make it a priority to get it done soon. i am guessing that for the time 
being i can implement it for classification (do tell me if regression is more 
important or not), is doing it for categorical data required for the time 
being? and how about implementation for importance?

Original comment by abhirana on 21 Jan 2013 at 5:19

Changed state: Started

GoogleCodeExporter commented 9 years ago

Thanks for taking the time to work on this. I only use classification, I do not 
use categorical data and importance is not critical for my work.

Good luck with the PhD defense, my turn is only a few months away :)

Original comment by nicolas....@gmail.com on 21 Jan 2013 at 9:43

devmax / randomforest-matlab

Addition of new training examples to an existing RF classifier? #42