Get all node values from all trees in rf

sslavian812 commented 6 years ago

Hi! I'm working on a research experiment, which requires from me to get the "max possible value", that the regression forest of yours can predict at the moment.

I've been trying to get that value using python api. I use RandomForestWithInstances in python.
It seems, there is no way I can get something like "all nodes from all trees", other than serializing the RandomForest to string or to tex and reading it.

model.rf.ascii_string_representation() gives me something like that.

Click to expand string representation. . .

``` { "value0": { "value0": 10, "value1": 9, "value2": true, "value3": false, "value4": { "value0": 5, "value1": 20, "value2": 3, "value3": 2.0, "value4": 3, "value5": 1.0, "value6": 1048576, "value7": 1e-8, "value8": 1000.0, "value9": false } }, "value1": [ { "value0": [ { "value0": [], "value1": [], "value2": 0, "value3": { "value0": 1, "value1": 2 }, "value4": { "value0": 0.5555555555555556, "value1": 0.4444444444444444 }, "value5": { "value0": 4, "value1": 0.41871805715741286, "value2": { "type": 0, "data": 0 } }, "value6": { "value0": 0.0, "value1": 0.0, "value2": { "value0": 0, "value1": 0.0, "value2": 0.0 } } }, // some more here ... ], "value1": 2, "value2": 1 } ], "value2": 6, "value3": [], "value4": NaN, "value5": [ 3, 0, 0, 0, 0, 0 ], "value6": [ { "value0": 3.0, "value1": NaN }, { "value0": 0.0, "value1": 1.0 }, { "value0": 0.0, "value1": 1.0 }, { "value0": 0.0, "value1": 1.0 }, { "value0": 0.0, "value1": 1.0 }, { "value0": -Infinity, "value1": Infinity } ] } ```

What this value0 ... value6 are supposed to mean? I'm totally confused.

I tried to examine the c++ code from this repository, but InputArchive, JSONInputArchive, and other weird template structures seem to complicated and messy for me at the moment.

As far as a understood, I need to get std::vector<node_type> the_nodes; from the k_ary_random_tree somehow, and then get rfr::util::weighted_running_statistics<num_t> response_stat; from each node (k_ary_node, I suppose).

Can you please help me with this issue?
It can become a contribution to the code base, if I could understand how things work here.

sfalkner commented 6 years ago

Hey, hmmm....getting the largest possible value that the forest can predict is a tricky number to compute. For a single tree, this is trivial to compute, but for the forest things are more complicated. It's quite likely that the highest prediction of individual trees do not occur for the same input (due to the randomization during training). It is not ever guaranteed that the highest prediction is for any of these inputs. Maybe I misunderstand what you are trying to do, so could you please elaborate on what you mean by "max possible value" that the forest can predict?

Regarding you questions:

No you can't get all the internal nodes with the Python API. The C++ internals are usually not of interest, so I never exposed any of that.
The string representation is only for when you want to pickle the forest. The names valuex come from the serialization library I use, and carry no meaning. You can reconstruct the object from the string though, but it's hopeless to try to find all the leafs from that representation.
You could get all the nodes and then get the response_stat to compute the largest output for a tree, but as I argued above, that is not necessarily the true global maximum.

Hope that helps.

sslavian812 commented 6 years ago

@sfalkner Thank you for the quick reaction!

I was trying to get a max possible value for each tree in the forest and then combine that values somehow. Maybe as calculating average or some other way.
I want to be able to compare two random forests in terms of "which one will probably give me the bigger value for a next (random) object".

Here is why: I'm experimenting with https://github.com/automl/SMAC3 hyperparameter optimization for my research work. I have a few running smac instances, and on each step want to choose one of them to run. I want to use that smac, which underlying random forest will potentially give me bigger EI value. Thus, I'm investigating the random forest internals :)

It would be even better, I i could to find, which objects tend to get maximum value out of the random forest, but it's another story. I'was thinking about this post: https://stats.stackexchange.com/questions/205145/find-max-value-of-random-forest-regressor-output

sfalkner commented 6 years ago

So you are actually looking for the the maximum EI value possible? For that you need the mean and variance prediction of the forest which involve all trees. Furthermore, EI is a non-stationary quantity, meaning the values you can achieve change over time. For a model with a lot of points that is certain about the optimum, the EI value are much smaller compared to a model trained with less data. So do be careful with comparing those.

sslavian812 commented 6 years ago

Thank you for the information.

I added some methods, which will calculate the largest output for each tree for me. https://github.com/sslavian812/random_forest_run/commit/e8f8d3caa483cbd338d623a9f0e65e3bfbb89c51

I'm new to python and totally new to wrapping c++ code into python interface with swig. I hoped, that if I install pyrfr from the repository, the interfaces will be rebuilt upon installation.

pip install git+git://github.com/sslavian812/random_forest_run.git

Gives me an error

Collecting git+git://github.com/sslavian812/random_forest_run.git
  Cloning git://github.com/sslavian812/random_forest_run.git to /tmp/pip-nqd62n11-build
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "..../lib/python3.4/tokenize.py", line 438, in open
        buffer = _builtin_open(filename, 'rb')
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-nqd62n11-build/setup.py'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-nqd62n11-build/

Do you have an idea what am I missing? Is there some quick way to install an experimental version of pyrfr?

sfalkner commented 6 years ago

There is, but it is a bit cumbersome. You will need cmake, doxygen and boost (a C++ library) then go into the git repo and execute the following commands:

mkdir build
cd build
cmake ..
make pyrfr_docstrings
cd python_package
pip install . --user

You will have to repeat this every time you change something in the C++ code to build the python package. Wrapping your functionality should work out of the box and you should have access to them without doing anything else. Let me know if you have anymore troubles.

sslavian812 commented 6 years ago

Thank you, it did the trick! I'd be more than happy to make a pull request, if you consider this feature useful.

automl / random_forest_run

Get all node values from all trees in rf #45