Merge with MemoryConstrainedTreeBoosting.jl?

Maybe we can merge projects?

EvoTrees:

More loss functions supported
Cleaner code
Working on integrating with Julia ecosystem

4-5x faster on CPU
Early stopping

MemoryConstrainedTreeBoosting.jl is a library I've been working on for a couple years that would allow me to control the loading and binning of data, so I could (a) do feature engineering in Julia and (b) use all the memory on my machine for the binned data. I've also spent a lot of time on speed because 10% faster ≈ training done 1 day sooner for my data sets. I think it is quite fast. I didn't bother documenting the library until today, however.

With our powers combined...!

A benchmark below. 4 threads on my 2013 quad-core i7-4960HQ.

 pkg> add https://github.com/brianhempel/MemoryConstrainedTreeBoosting.jl

using Statistics
using StatsBase:sample
using Revise
using EvoTrees

nrounds = 200

# EvoTrees params
params_evo = EvoTreeRegressor(T=Float32,
        loss=:logistic, metric=:logloss,
        nrounds=nrounds,
        λ=0.5, γ=0.0, η=0.05,
        max_depth=6, min_weight=1.0,
        rowsample=1.0, colsample=0.5, nbins=64)

# MemoryConstrainedTreeBoosting params
params_mctb = (
        weights                 = nothing,
        bin_count               = 64,
        iteration_count         = nrounds,
        min_data_weight_in_leaf = 1.0,
        l2_regularization       = 0.5,
        max_leaves              = 32,
        max_depth               = 6,
        max_delta_score         = 1.0e10, # Before shrinkage.
        learning_rate           = 0.05,
        feature_fraction        = 0.5, # Per tree.
        bagging_temperature     = 0.0,
      )

nobs = Int(1e6)
num_feat = Int(100)
@info "testing with: $nobs observations | $num_feat features."
X = rand(Float32, nobs, num_feat)
Y = Float32.(rand(Bool, size(X, 1)))

@info "evotrees train CPU:"
params_evo.device = "cpu"
@time m_evo = fit_evotree(params_evo, X, Y);
@time fit_evotree(params_evo, X, Y);
@info "evotrees predict CPU:"
@time pred_evo = EvoTrees.predict(m_evo, X);
@time EvoTrees.predict(m_evo, X);

import MemoryConstrainedTreeBoosting

@info "MemoryConstrainedTreeBoosting train CPU:"
@time bin_splits, trees = MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@time MemoryConstrainedTreeBoosting.train(X, Y; params_mctb...);
@info "MemoryConstrainedTreeBoosting predict CPU, JITed:"
save_path = tempname()
MemoryConstrainedTreeBoosting.save(save_path, bin_splits, trees)
unbinned_predict = MemoryConstrainedTreeBoosting.load_unbinned_predictor(save_path)
@time pred_mctb = unbinned_predict(X)
@time unbinned_predict(X)

$ JULIA_NUM_THREADS=4 julia --project=. experiments/benchmarks_v2.jl
[ Info: testing with: 1000000 observations | 100 features.
[ Info: evotrees train CPU:
 98.929771 seconds (64.89 M allocations: 21.928 GiB, 2.12% gc time)
 83.160324 seconds (187.35 k allocations: 18.400 GiB, 1.69% gc time)
[ Info: evotrees predict CPU:
  2.458015 seconds (4.50 M allocations: 246.320 MiB, 38.75% compilation time)
  1.598223 seconds (4.59 k allocations: 4.142 MiB)
[ Info: MemoryConstrainedTreeBoosting train CPU:
  20.320708 seconds (16.04 M allocations: 2.480 GiB, 1.48% gc time, 0.01% compilation time)
  15.954224 seconds (3.10 M allocations: 1.714 GiB, 2.66% gc time)
[ Info: MemoryConstrainedTreeBoosting predict CPU, JITed:
 14.364365 seconds (11.80 M allocations: 692.582 MiB, 25.95% compilation time)
  0.778851 seconds (40 allocations: 30.520 MiB)

Very impressive performance! I thought I had managed to get decent optimisation on the histogram building and that remaining speed bottle neck was mainly coming from the handling for the sets of indices flowing through the various nodes, but it seems your approach pushes performance a step above!

I'm definitly interested to look into how the best features of each library could be brought together. I would need some time to digest a bit how MemoryConstrainedTreeBoosting is structured. Maybe quick chat may help figuring out the broad lines of how to bring all that to life?

BTW, are you using any subsampling trick on the observations such as based on 1st order gradient magnitude or are all the histograms built from 100% of the obs?

BTW, are you using any subsampling trick on the observations such as based on 1st order gradient magnitude or are all the histograms built from 100% of the obs?

100%. But always calculating the smaller of two siblings first so the other can be derived.

There's a couple further tricks to get performance. The first/easiest to try is loop tiling. Because it's not just the indices that slow you down, but the δ δ² and 𝑤 arrays. Each feature-datapoint is 1 byte, but the inner loop is also consuming 4 or 8 bytes for its index, 4 bytes for δ, 4 bytes for δ² and 4 bytes for 𝑤. For 1M obs, 100 features, that's 1000000000(1+4+4+4+4)100 = 1.7TB of memory traffic, even though the histograms at least will always be in L1 cache.

To keep both the histogram(s) and indices/δ/δ²/𝑤 in cache as best as possible, you can tile: have each thread process ~10 features at a time, ~2,000 indices at a time. 2,000(4+4+4+4) = 32KB of indices/δ/δ²/𝑤, 10(64 bins 3 4bytes) = 8KB of histograms. That will handily fit in a L2 cache, and effectively reduces the traffic to RAM for indices/δ/δ²/𝑤 by 10x.

Currently, in MCTB each thread builds 12 histograms at a time, 20736 or 320 indices at a time, depending on whether it's the first pass (which can be expressed as a range over all the data) vs the sparse passes deeper in the tree.

Other tricks in MCTB:

δ/δ²/𝑤 arrays and bins are interleaved into one array so that a single SIMD load/add/store can load/add/store the δ/δ²/𝑤 for a point simultaneously. It also frees up registers so MCTB can queue up more parallel loads in the inner loop.
MCTB tries to parallelize everything (Amdahl's law), including the partitioning of indices into left/right children.
If there are few enough observations, MCTB uses UInt32's for indexing.

Maybe quick chat may help figuring out the broad lines of how to bring all that to life?

Yes, perhaps sometime next week. I think mostly it will just be porting over the above performance hacks.

@brianhempel Thanks a lot for providing those explanations.

I've started makig some refactoring in the dev-refactor branch. The optimizations per haven't been incorporated, but I've looked to adapt the structure to accomodate the performance bottlenecks you've mentionned.

δ𝑤 is now of shape [3, obs]: first and second order grads + weight
histogram building have been adapted in consequence, yet performance remains similar in their current form
histograms are pre-allocated into a vector of nodes, and those histograms are built has a vector of length num features, each being vectors of length 3 * nbins (grads and weight).

Regarding the split of the observation ids into left/right children, I've attempted a non allocating approach here, but logic has apparently some flaws as it breaks if using a row-subsampling < 1.0 or for depth > 3 (when more than a single depth reusses the pre-allocated left/right views). See here for how there's initialized.

The above had a very minor impact on speed on smaller dataset, but the size of allocations was cut in half, which appears to have greater benefits on larger dataset (30% speed improvement on a 5M rows / 100 features).

Does this restructuring effectively looks better adapted to accomodate the tiling and other optimizations? I somehow held the belief that the compiler was able to perform those kind of optimisations, a bit like it seems able to do on simple loops where @simd annotations aren't needed. Any feedback or hints on how you'd see the best way to bring the lower hanging fruit would be much appreciated!

Quick thoughts:

I would avoid the @simd macro for update_hists!. There's nothing there that can be trivially SIMD'd because there are gather/scatter load/stores. So if the macro is triggering it is likely triggering erroneously anyway. You can check for SIMD instructions with InteractiveUtils.@code_native or InteractiveUtils.@code_llvm.
Reducing allocation numbers is almost always a win. I believe Julia's GC is single-threaded, so avoiding triggering GC is important. I can hear my server's fans spin down when it hits GC.
I'm now using Julia 1.5.4 instead of Julia 1.6. I found Julia 1.6 has some weird and significant performance regressions on certain loops.
Loop tiling is rarely automatic. There's a @polly macro but it's not documented and might require a special build of Julia/LLVM. Might not hurt to try it though. I suspect Julia may get some macros one day for this. There is a TiledIteration.jl package that will generate the index chunks for you. Hopefully it doesn't allocate.
Shouldn't it be X_bin[𝑖[i], feat] <= cond_bin? I like to call these kinds of variables ii for "index index" rather than just i because an ii is not valid to use directly in the base data.
To have interleaving to later take advantage of SIMD, you will need padding too b/c SIMD load/stores are 4 floats (SSE) or 8 floats (AVX), not 3 or 6. It wastes a little memory but in my benchmarks it was worth it.

Thank you for these hints!

I've spent most of my recent efforts to bring the EvoTrees ready for a release baesd on the restructuring to better handle the various optimisations. In particular, refactoring the GPU part to handle the move to the set split. It would be part of the v0.8.0 release.

On Julia 1.6.1 (which saw the main pain point about loops slowdown resolved), EvoTrees is now 3X slower on your benchmark (15s vs 5s). Allocations have also been reduced significatively. EvoTrees initialization is slow, mostly because of the binning process, which takes about 2 sec out of the 15 sec, so that could be another low hanging fruit, though less impactful for large iterations.

I would avoid the @simd macro for update_hists!

Benchmark shows slight speedup for some reason when adding the @simd and with correct results, so I would let it there at the moement. That being said, given the significant speed gap with your implementation, my understanding is that it's likely on the histograms building that I should next focus with the tiled+simd approach. If it's something feasible on your side, having PoC of such implementation that performs the equivalent of this hist building would be much helpful. Having it benchmark could confirm whether it's really that hist building that's the remaining bottleneck. Otherwise, I'd assume it's the remaining memory allocations during the loop that hurts the process.

5. Shouldn't it be X_bin[𝑖[i], feat] <= cond_bin?

Absolutely! Thanks for pointing this out. A multi-thread version has also been implemented.

Benchmark shows slight speedup for some reason when adding the @simd and with correct results, so I would let it there at the moment.

With @code_native I see it is using SIMD instructions, but not in any useful way: it's using SIMD registers to load/add one float at a time, just like it would with normal registers. I'm guessing it's dumb luck or lower register pressure that makes it faster. I think x86 kind of doesn't have enough general purpose registers.

confirm whether it's really that hist building that's the remaining bottleneck

I'm sure it is. It's the only part of the computation that's O(nobs*nfeatures). Set partitioning is O(nobs) and split selection is O(nbins*nfeatures).

tiled+simd PoC

On it.

Evovest / EvoTrees.jl

Merge with MemoryConstrainedTreeBoosting.jl? #91