Open GoogleCodeExporter opened 9 years ago
Hi Johannes
i think the package does not tell what examples were inbag, just in general
that a particular example is inbagged this many number of times overall.
if this information was available if would be very memory intensive to note
down that a particular set of examples were used at a particular node in the
tree. maybe you could writeline everything to a file as it is constructed.
would that help?
also there might be a bit of issue if you are using regression RF, the thing is
that a new dataset is created from bagging and the original example list is
sort of lost to the tree creation. but i know the exact place to do this for
classification
Original comment by abhirana
on 11 May 2012 at 8:08
Hi,
thank you for your answer. I guess I was able to get the bootstrap samples.
To obtain the samples which are in bag you simply have to enable the option in
the parameters.
rfopt.keep_inbag = true; %keep track if a sample is in-bag for a tree
The classifier then returns a matrix with the size of number_of_samples x
number_of_trees. However this is a binary matrix not giving the number of times
a sample was selected.
I digged into the code of the classRF.cpp and found that the variable jin holds
the inbag values. As I was to lazy to introduce a new variable I simply changed
the assignment in lines 378 and 414 from
jin[k] = 1;
to
jin[k] = jin[k] + 1; //get number of times selected for the bootstrap sample
This now returns the number of times a sample is selected in each bootstrap
sample. At first sight it seems to work and not to interfere with any other
calculations as jin is only checked for zero values.
I will now check if I can reproduce the gini impurity values.
Johannes
Original comment by johannes...@gmail.com
on 14 May 2012 at 1:24
Cool
Lol I totally forgot that option
Yup changing jin is correct. The logic won't be affected as all what is
checked is whether it's 0or not 0
Original comment by abhirana
on 14 May 2012 at 8:03
hi,
I managed to reconstruct the Gini Impurity (GI) of every single node. Therefore
I used the bootstrap sample, the best feature and the threshold. With the GI
the decrease in GI is calculated. However my results do not match the outputs
from the RF.
Do you have any idea how Breiman calculates the decrease? Right now I am
looking into the Fortran code but haven't understood it yet. I guess
findbestsplit is the right method to look at.
Here is how I calculate the GI and the decrease in GI.
The GI you can see in image 1. L is the number of total classes and p_l^w is
the frequency of samples of class l in the node w (p_l^w = n_l_w / n_w).
The decrease in GI is simply the difference of the GI of node w, with its two
child nodes w1, w2 weighted by the samples of the child nodes.
dec(w) = i(w) - n_w1/n_w * i(w1) - n_w2/n_w * i(w2)
Do you see any errors I miss?
Original comment by johannes...@gmail.com
on 27 May 2012 at 8:51
Attachments:
hey johannes
the below is how breiman calculates the decrease. i ignored the weights of the
classes in the equation btw.
parentNum =0, parentDen =0
for i=0:(num_class-1)
parentNum += tclasspop[i] * tclasspop[i] (make sure these are doubles else you will have overflow)
parentDen += tclasspop[i]
end
Sort feature k and do the following for the examples according to the sort order
rrn = parentNum;
rrd = parentDen;
rln = 0
rld = 0;
wl[0..(num_class-1)] = 0;
wr[0..(num_class-1)] = tclasspop[0..(num_class-1)]; //put the class population
in wr
crit0 = parentNum/parentDen;
//basically
//crit0 = sum(classpopulation_{i=1:num_class}^2) /
sum(classpopulation_{i=1:num_class})
for example_num = start to end
counts = number of times example was oobed
rld = rld + counts
rrd = rrd - counts
rln = rln + counts * (2 * wl[y_label[example_num]] + counts );
rrn = rrn + counts * (-2 * wr[y_label[example_num]] + counts );
crit = (rln / (GLOBAL_DATA_TYPE) rld) + (rrn / (GLOBAL_DATA_TYPE) rrd);
wl[y_label[example_num]] = wl[y_label[example_num]] + counts;
wr[y_label[example_num]] = wr[y_label[example_num]] - counts;
end
critMax = highest crit
pick the feature with highest crit
and then the decision of the split is
decsplit = critMax - crit0;
decomposing crit and ignoring the counts for the time being. if class_i is the
class for the current example then the crit value is
crit = rln/rld + rrn/rrd
= (2 * num_nodes_of_class_i_on_left)/ num_nodes_on_left - (2*num_nodes_of_class_i_on_right) / num_nodes_on_right
maybe expanding dec(w) should give a similar answer as how breiman calculates
it?
Original comment by abhirana
on 28 May 2012 at 4:03
Hey abhishek,
thank you for your fast answer. It helped me figuring out some details on
Breiman's code.
However what I still do not fully understand is the part
for example_num = start to end
....
end
Especially the value of the variable 'counts' in your code is unclear to me.
Could you please explain me in more detail what you mean by "number of times
example was oobed"? Which example is meant here? And over which examples do I
have to iterate?
In the code of Breiman there is first the iteration over the mtry samples and
then the iteration over nsp. Is it correct that the loop over ndstart to ndend
- 1 is over the possible thresholds for the split?
do mt = 1, mtry
...
do nsp = ndstart, ndend-1
...
end
end
I haven't yet understood the whole impurity criterion as used in the code of
Breiman. What is the theoretical concept of it? I always thought the Gini
importance is based on the mean decrease of Gini impurity?
Once again thank you for your help.
Johannes
Original comment by johannes...@gmail.com
on 3 Jun 2012 at 10:12
"for example_num = start to end
....
end
Especially the value of the variable 'counts' in your code is unclear to me.
Could you please explain me in more detail what you mean by "number of times
example was oobed"? Which example is meant here? And over which examples do I
have to iterate?"
my code and randomforest-R code uses two different version of creating the
dataset for individual trees when it comes to regression and classification.
RF uses bagging so roughly only 63% of the original dataset will be used in
training in the new dataset for each tree, and within that 63% examples, many
will be chosen multiple time, because we are using bagging with replacement.
if you examine the classification code, what is done for each tree is that a
new dataset is not created and we keep track of how many times an example was
in the bag. i had a typo there where i said, oobed, i mean inbagged. whereas
for regression code, a new dataset is created and what examples in the new
dataset correspond to the original dataset is somewhat lost.
now for the tree example
function create_tree(all examples dataset)
% this is strictly not a function, its more about how the function recursively searches the middle point
set initial start_range to 1 and end_range to N(number of examples
while some condition (like number of nodes in tree < some value)
middle_point, indx = findbestsplit(start_range, end_range)
%reshape the input array according to the indx sent
% due to the complicated nature of representing the indx array i am not putting it below
next iteration search
middle_point_1 = findbestsplit(start_range, middle_point)
middle_point_2 = findbestsplit(middle_point+1, end_range)
in the next to next iteration
middle_point_11 = findbestsplit(start_range, middle_point_1)
middle_point_12 = findbestsplit(middle_point_1+1, middle_point)
middle_point_21 = findbestsplit(middle_point+1, middle_point_2)
middle_point_22 = findbestsplit(middle_point_2+1, end_range)
and so on
function middle_point, indx = findbestsplit(start_range, end_range)
for mt = 1, mtry
% this is probably the most important part
% sort the feature mt for the examples
% so that the indices of the examples and the feature values (in ascending order) are obtained
for nsp = ndstart, ndend-1
here it calculates the gini index in a sorted order.
end
end
% the middle point is where the gini index was maximum
% indx is the sort of the examples for the feature whose gini index was maximum
% note thtat the indx is then used to reshape the original array so that
% all examples left of the middle point in subsequent iteration are together and all examples right of the middle point are together
breiman's classicification code has a real nice shortcut of using a presorted
array. it helps it in scaling RF as O(n) rather than as O(n log n). this though
has complicated the whole classification code and you wont see sorting done at
all within the code (except initally). a better way would be to look at the
regression code where you will see the sorting being done and it is a bit more
easier to read. except for how gini/deviance is calculated actually regression
and classification code do the same thing (but do it differently, presorting
vs. sorting on the fly).
counts = the number of times a particular examples is inbagged and as the gini
wont change if all the examples (same example but inbagged multiple times) are
the same so the increment/decrement of counts works
"In the code of Breiman there is first the iteration over the mtry samples and
then the iteration over nsp. Is it correct that the loop over ndstart to ndend
- 1 is over the possible thresholds for the split?
do mt = 1, mtry
...
do nsp = ndstart, ndend-1
...
end
end
"
ndstart-ndend-1 dynamically change as the tree progresses. yeh i guess in
conjunction with the sorted order it helps in calculating all possible splits.
it may just be the fact that the sorted order of values helps in calculating
the max (i am not sure its the max gini, my gini-related knowledge is limited)
http://mathworld.wolfram.com/GiniCoefficient.html (the 2i-n-1 formula)
in a nice closed formed iterative way.
i think it is possible to derive what is being done, ie finding the middle
point for sorted values and that has maximum gini, and relate to the gini coeff
but it will take me a few more days.
Original comment by abhirana
on 6 Jun 2012 at 7:29
Hi to all! 3 columns inputs as double. Years,a vector and precipitations. How
can create one script m file that it calls for N=5000 repetitions random as
bootstrap method for the three columns? I have one csv file called year with
these 3 columns as double.
Original comment by emotion....@gmail.com
on 16 Sep 2013 at 10:18
Original issue reported on code.google.com by
johannes...@gmail.com
on 11 May 2012 at 10:57