Closed dhanak closed 1 year ago
For an example of the “butterfly effect” mentioned above, have a look at the following heatmap of the EER of a regression forest. The number of genuine training samples is along the horizontal axis, while the number of used features is on the vertical axis. This surface should be fairly smooth, but notice the significant gap/jump as the number of features increases from 110 to 111. It causes an approximately 4% accuracy drop in EER values.
It took me a while to figure out the following:
n_subfeatures
.n_subfeatures
= 10, this drawing uses a different algorithm than for 11. (There is an explicit if
branch in utils.jl
, in the implementation of the hypergeometric distribution, inherited from numpy
.)With the proposed fix, not only the wrinkle in the heatmap goes away, but also all accuracy values improve noticeably.
Wow. Fine work identifying and diagnosing a hairy issue. This represents a lot of work!
the crux of which is to use pre-generated pseudo-random seeds to move all the generators into a unique state.
For my part, I'm inclined to support your kind offer for a fix, but could you please describe this in a bit more detail and/or point to the relevant code in your fork (ideally both).
Note to self: breaking change.
@rikhuijzer @bensadeghi
the crux of which is to use pre-generated pseudo-random seeds to move all the generators into a unique state.
For my part, I'm inclined to support your kind offer for a fix, but could you please describe this in a bit more detail and/or point to the relevant code in your fork (ideally both).
Sure, here's the diff: https://github.com/JuliaAI/DecisionTree.jl/compare/dev...dhanak:DecisionTree.jl:dev. First I check if seed!
is applicable on the specific rng instance, and if yes, generate a vector of random uints that will act as seeds, one per tree. Then for each tree, after cloning the rng instance, I invoke seed!
with the clone and the corresponding pre-generated seed. If, on the other hand, seed!
is not applicable, then I fall back to the current “draw a variable number of values from rng” approach, which I'm still unhappy with.
I contemplated drawing not one, but 1000, or even 1,000,000 random values per tree (i.e., rand(_rng, 1_000_000i)
), which should, in the vast majority of cases, put enough distance between the rng copies, and it's not too expensive for the built-in Xoshiro and MersenneTwister generators, either. But I felt uneasy about it, because the built-in generators do have a seed!
implementation, so they will use the other branch anyhow. On the other hand, this fallback case is going be used for other, 3rd party, or yet unknown generators, with which burning millions of random values could be a significant overhead in computation costs.
This is the main reason why I haven't issued a pull request yet: I'm still uncertain how to handle the fallback case properly. Any ideas welcome!
Impressive! Well spotted!
The problem is that drawing 1, 2, 3, etc. random numbers from the rng copies does not break the connection between them, and the resulting forest will not be random at all. Many trees, in fact, will be identical or very similar, and not by pure chance.
Ouch.
The changes such as
if rng isa Random.AbstractRNG
+ seeds = applicable(Random.seed!, rng, 0) ? rand(rng, UInt, n_trees) : nothing
Threads.@threads for i in 1:n_trees
# The Mersenne Twister (Julia's default) is not thread-safe.
_rng = copy(rng)
- # Take some elements from the ring to have different states for each tree. This
- # is the only way given that only a `copy` can be expected to exist for RNGs.
- rand(_rng, i)
+ if seeds !== nothing
+ # Seed the ring for each tree with a pseudo-random seed to put it
+ # into a predictable, but different state from all the others.
+ Random.seed!(_rng, seeds[i])
+ else
+ # Take some elements from the ring to have different states for each tree.
+ # This is the only way given that only a `copy` can be expected to exist for RNGs.
+ rand(_rng, i)
+ end
look good to me. Why is the applicable
needed? Is that for Julia below 1.6?
Random.seed!(_rng, seeds[i])
could probably be replaced by Random.seed!(_rng, i)
. That way the seeds
vector can be omitted. It doesn't matter too much for performance, but it should aid readability.
By the way, I wonder whether we should guarantee that values in the current seeds
vector are all distinct. In the current implementation, there may exist trees which end up in exactly the same state. It shouldn't matter too much since the number of trees in a random forest is typically much smaller than the possible values for UInt
s, but still good to double check.
look good to me. Why is the applicable needed? Is that for Julia below 1.6?
I added that part because it is not required to implement Random.seed!
for all rng classes. (From the Julia docs: “Some RNGs don't accept a seed, like RandomDevice”.)
Random.seed!(_rng, seeds[i])
could probably be replaced byRandom.seed!(_rng, i)
. That way the seeds vector can be omitted. It doesn't matter too much for performance, but it should aid readability.
II think that would be too deterministic, i.e., the random numbers generated for the trees would not depend on the state of the rng passed to build_forest
, only on its type. But we can employ the same trick as on the other branch (when the rng
parameter is a number), i.e.: generate a single initial seed, and then offset that seed with i
. That would indeed save on the memory, and still use a different yet deterministic seed for each tree, which depends on the state of the passed rng
. Something like this, perhaps?
# Not all rngs are expected to implement `seed!`
spread = if applicable(Random.seed!, rng, 0)
seed0 = rand(rng, UInt)
# Seed each ring with a different (but deterministic) seed.
(rng, i) -> Random.seed!(rng, seed0 + i)
else
# Take some elements from the ring to have different states for each tree.
(rng, i) -> rand(rng, i)
end
Threads.@threads for i in 1:n_trees
# The Mersenne Twister (Julia's default) is not thread-safe.
_rng = copy(rng)
spread(_rng, i)
inds = rand(_rng, 1:t_samples, n_samples)
forest[i] = build_tree(
labels[inds],
features[inds,:],
n_subfeatures,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase,
rng = _rng,
impurity_importance = impurity_importance)
end
I added that part because it is not required to implement Random.seed! for all rng classes. (From the Julia docs: “Some RNGs don't accept a seed, like RandomDevice”.)
:+1:
II think that would be too deterministic, i.e., the random numbers generated for the trees would not depend on the state of the rng passed to
build_forest
, only on its type.
Better be safe than sorry, I guess :+1:.
As a sidenote,
# Not all rngs are expected to implement `seed!`
spread = if applicable(Random.seed!, rng, 0)
seed0 = rand(rng, UInt)
# Seed each ring with a different (but deterministic) seed.
(rng, i) -> Random.seed!(rng, seed0 + i)
else
# Take some elements from the ring to have different states for each tree.
(rng, i) -> rand(rng, i)
end
can be extracted into a separate function for readability. Something like:
function spread!(rng, i)
if applicable(Random.seed!, rng, 0)
seed0 = rand(rng, UInt)
# Seed each ring with a different (but deterministic) seed.
return Random.seed!(rng, seed0 + i)
else
# Take some elements from the ring to have different states for each tree.
return rand(rng, i)
end
end
I've added the exclamation mark to the end of the name since that's a convention for mutation of arguments in Julia.
@dhanak Your bug find helped me a lot for a paper that I'm currently working on! Can I mention you in the acknowledgements? I assume yes and let me know if you don't want that. You can respond here or to t.h.huijzer@rug.nl.
@dhanak do you want to open a pull request here? I can also do it if you want and make you co-author.
can be extracted into a separate function for readability
I deliberately didn't want to make the if part of the function, because I didn't want to call applicable
for every single tree. Also, this way, you mixed up the shared rng instance with the treewise copies. (I picked shadowing variable names out of laziness, I'll grant you that.) Here's another alternative, using multiple dispatch and proper functions. WDYT?
In utils.jl
:
using Random
...
spread!(rng::Random.AbstractRNG, seed::Nothing, i::Integer) = rand(rng, i)
spread!(rng::Random.AbstractRNG, seed::Integer, i::Integer) = Random.seed!(rng, seed + i)
And then:
seed = applicable(Random.seed!, rng, 0) ? rand(rng, UInt) : nothing
Threads.@threads for i in 1:n_trees
# The Mersenne Twister (Julia's default) is not thread-safe.
_rng = copy(rng)
util.spread!(_rng, seed, i)
inds = rand(_rng, 1:t_samples, n_samples)
forest[i] = build_tree(
labels[inds],
features[inds,:],
n_subfeatures,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase,
loss = loss,
rng = _rng,
impurity_importance = impurity_importance)
end
Your bug find helped me a lot for a paper that I'm currently working on! Can I mention you in the acknowledgements?
Absolutely, I'd be honored!
@dhanak do you want to open a pull request here? I can also do it if you want and make you co-author.
Yeah, I will, as soon as I find an implementation that we are moderately satisfied with, both syntactically and semantically. That being said, I'm still uneasy about the fallback solution.
Yes the multiple dispatch looks good, yet the dispatch on Nothing
vs. Integer
to distinguish between whether applicable
also makes me uneasy.
Maybe that @ablaom knows whether we even need to handle the missing of seed!
. Since accuracy will be reduced when using rand
, maybe we should throw a warning?
spread! = if applicable(Random.seed!, rng, 0)
Random.seed!
else
@warn "The used RNG does not implement `Random.seed!`. Falling back to `rand` which will reduce accuracy of the fitted model."
rand
end
Threads.@threads for i in 1:n_trees
# The Mersenne Twister (Julia's default) is not thread-safe.
_rng = copy(rng)
seed0 = rand(rng, UInt)
spread!(_rng, seed0 + i)
...
end
With this adjustment, we are more or less back to my previous, “inline” suggestion. Please note, however, that you do not want to take seed0 + i
random numbers from any rng on the fallback branch, that would be uncomputably many. The warning is a good idea, perhaps with a maxlog=1
attribute to make it appear only once.
The if
could be extracted as a separate function, e.g., make_spread!
, which returns a function.
function make_spread!(rng::R)::Function where {R <: Random.AbstractRNG}
# Not all rngs are expected to implement `seed!`
if applicable(Random.seed!, rng, 0)
seed0 = rand(rng, UInt)
# Seed each ring with a different (but deterministic) seed.
return (_rng::R, i::Integer) -> Random.seed!(_rng, seed0 + i)
else
@warn "The used RNG does not implement `Random.seed!`. Falling back to `rand` which will reduce accuracy of the fitted model." maxlog=1
# Take some elements from the ring to have different states for each tree.
return rand
end
end
What if we tie seeding and copying together, like this?
function replicate(rng::R, n::Integer)::Vector{R} where {R <: Random.AbstractRNG}
clones = [deepcopy(rng) for _ in 1:n]
# not all rngs are expected to implement `seed!`
if applicable(Random.seed!, rng, 0)
seed_base = rand(rng, UInt)
# seed each ring with a different (but deterministic) seed
for i in 1:n
Random.seed!(clones[i], seed_base + i)
end
else
@warn "The used RNG does not implement `Random.seed!`. Falling back to `rand` which will reduce accuracy of the fitted model." maxlog=1
# take some elements from the ring to have different states for each clone
for i in 1:n
rand(clones[i], i)
end
end
return clones
end
And then call as:
if rng isa Random.AbstractRNG
local_rngs = util.replicate(rng, n_trees)
Threads.@threads for i in 1:n_trees
_rng = local_rngs[i]
inds = rand(_rng, 1:t_samples, n_samples)
forest[i] = build_tree(
...
Please also note, that I replaced copy
with deepcopy
, because that also works for RandomDevice
, and presumably other, non-standard RNGs as well.
I like it because it moves all the complexity that we've been talking about for this whole issue in one separate place.
function replicate(rng::R, n::Integer)::Vector{R} where {R <: Random.AbstractRNG}
can probably be
function replicate(rng::Random.AbstractRNG, n::Integer)
since Julia will likely infer the return type automatically. For example, when replacing the loop by broadcasting and taking two RNGs as example:
julia> @code_warntype deepcopy.([Random.GLOBAL_RNG, Random.GLOBAL_RNG])
MethodInstance for (::var"##dotfunction#314#3")(::Vector{Random._GLOBAL_RNG})
from (::var"##dotfunction#314#3")(x1) in Main
Arguments
#self#::Core.Const(var"##dotfunction#314#3"())
x1::Vector{Random._GLOBAL_RNG}
Body::Vector{Random._GLOBAL_RNG}
1 ─ %1 = Base.broadcasted(Main.deepcopy, x1)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(deepcopy), Tuple{Vector{Random._GLOBAL_RNG}}}
│ %2 = Base.materialize(%1)::Vector{Random._GLOBAL_RNG}
└── return %2
which shows that the return type for the body is Body::Vector{Random._GLOBAL_RNG}
.
I see some very productive interaction here. Many thanks!
Do either of you know a good use-case non-seeding RNGs? If not, my vote is to remove support, unless you've found a way to reduce complexity.
I see some very productive interaction here. Many thanks!
Do either of you know a good use-case non-seeding RNGs? If not, my vote is to remove support, unless you've found a way to reduce complexity.
RandomDevice
doesn't support seed!
. Even though I don't really see using RandomDevice
as an rng passed to build_forest
as a viable option, it is nonetheless an AbstractRNG
instance. Other than that, I'm only familiar with the two other built-in generators, so the short answer is no.
Of course, it is a perfectly valid API design decision to accept only rngs which support seed!
, but then, I beleive, this information should be included in the docstrings and whatnot.
So, @ablaom, what's your position on supporting RandomDevice
? Shall I PR the more complicated code which uses seed!
only if it is usable, or use it without checking, and add a disclaimer in the docstrings (where applicable)?
Assuming an error is thrown if one attempts to use a non-seed RNG (which I expect it would) my vote would be for the latter option. @rikhuijzer what do you think?
Merge #174 introduced a change in the initialization of rngs per tree (both for classification or regression forests). Namely, instead of using the same rng for all trees, it creates a copy for all of them, and then pushes them into a different state by pulling a different number of random numbers from each of them. (See here for details.) The number of random numbers being drawn is equal to the index of the tree being built.
While I fully respect the intent, and appreciate that before this change, the code was not thread-safe, I must point out that this logic is fundamentally flawed. The problem is that drawing 1, 2, 3, etc. random numbers from the rng copies does not break the connection between them, and the resulting forest will not be random at all. Many trees, in fact, will be identical or very similar, and not by pure chance. I couldn't yet fully figure out why or where, but there is some implicit mechanism in the tree building which unintentionally resynchronizes the state of the rng copies. I have a hunch that it is related to the hypergeometric sampling in
_split!()
intree.jl
, but there could be something else, too.The negative effect, however, is clearly noticeable, because the classification/regression accuracy of the resulting forests is suboptimal, and introducing subtle changes in certain hyperparameters (such as
n_subfeatures
) end up causing major changes in the prediction accuracy. I'm still struggling to find a suitable and small enough example to demonstrate the effect clearly, I'll let you know if I find one.I have begun working on a fix in a forked repo, the crux of which is to use pre-generated pseudo-random seeds to move all the generators into a unique state. Now, I understand that
seed!
is not necessarily implemented for all rng classes, so I added an applicability test, and fall back to the current behavior whenseed!
is not available.To demonstrate the effect of this change, I ran the unit tests with the current version of
DecisionTree
and with my proposed change, and compared the results. The listed accuracies and confusion matrices improved noticeably in every single case, which sort of proves my point indirectly.Here's the diff. The left-hand side is produced by the official package, while the right-hand side is the output of my fork.