aloysius-lim / bigrf

Random forests for R for large data sets, optimized with parallel tree-growing and disk-based memory
91 stars 26 forks source link

Error estimates should not include examples that have never been out-of-bag #1

Closed aloysius-lim closed 11 years ago

aloysius-lim commented 11 years ago

Currently, error estimates are computed as a proportion of all examples. However, at early stages of building the forest, some examples have never been out-of-bag, thus they contribute to the error score. It would be unfair to count these examples as "errors".

Instead, error estimates should be computed as a proportion of all examples that have been out-of-bag at least once.

aloysius-lim commented 11 years ago

Fixing this causes instability in the error estimates in the first several trees. This leads to ugly plots of training error, which jumps about quite a bit before settling into a curve. Thus I have decided not to fix this, unless there is strong demand for it.