Open brianhempel opened 3 years ago
Very impressive performance! I thought I had managed to get decent optimisation on the histogram building and that remaining speed bottle neck was mainly coming from the handling for the sets of indices flowing through the various nodes, but it seems your approach pushes performance a step above!
I'm definitly interested to look into how the best features of each library could be brought together. I would need some time to digest a bit how MemoryConstrainedTreeBoosting
is structured. Maybe quick chat may help figuring out the broad lines of how to bring all that to life?
BTW, are you using any subsampling trick on the observations such as based on 1st order gradient magnitude or are all the histograms built from 100% of the obs?
BTW, are you using any subsampling trick on the observations such as based on 1st order gradient magnitude or are all the histograms built from 100% of the obs?
100%. But always calculating the smaller of two siblings first so the other can be derived.
There's a couple further tricks to get performance. The first/easiest to try is loop tiling. Because it's not just the indices that slow you down, but the δ δ² and 𝑤 arrays. Each feature-datapoint is 1 byte, but the inner loop is also consuming 4 or 8 bytes for its index, 4 bytes for δ, 4 bytes for δ² and 4 bytes for 𝑤. For 1M obs, 100 features, that's 1000000000(1+4+4+4+4)100 = 1.7TB of memory traffic, even though the histograms at least will always be in L1 cache.
To keep both the histogram(s) and indices/δ/δ²/𝑤 in cache as best as possible, you can tile: have each thread process ~10 features at a time, ~2,000 indices at a time. 2,000(4+4+4+4) = 32KB of indices/δ/δ²/𝑤, 10(64 bins 3 4bytes) = 8KB of histograms. That will handily fit in a L2 cache, and effectively reduces the traffic to RAM for indices/δ/δ²/𝑤 by 10x.
Currently, in MCTB each thread builds 12 histograms at a time, 20736 or 320 indices at a time, depending on whether it's the first pass (which can be expressed as a range over all the data) vs the sparse passes deeper in the tree.
Other tricks in MCTB:
Maybe quick chat may help figuring out the broad lines of how to bring all that to life?
Yes, perhaps sometime next week. I think mostly it will just be porting over the above performance hacks.
@brianhempel Thanks a lot for providing those explanations.
I've started makig some refactoring in the dev-refactor
branch. The optimizations per haven't been incorporated, but I've looked to adapt the structure to accomodate the performance bottlenecks you've mentionned.
[3, obs]
: first and second order grads + weightnum features
, each being vectors of length 3 * nbins
(grads and weight).Regarding the split of the observation ids into left/right children, I've attempted a non allocating approach here, but logic has apparently some flaws as it breaks if using a row-subsampling < 1.0 or for depth > 3 (when more than a single depth reusses the pre-allocated left/right views). See here for how there's initialized.
The above had a very minor impact on speed on smaller dataset, but the size of allocations was cut in half, which appears to have greater benefits on larger dataset (30% speed improvement on a 5M rows / 100 features).
Does this restructuring effectively looks better adapted to accomodate the tiling and other optimizations? I somehow held the belief that the compiler was able to perform those kind of optimisations, a bit like it seems able to do on simple loops where @simd
annotations aren't needed. Any feedback or hints on how you'd see the best way to bring the lower hanging fruit would be much appreciated!
Quick thoughts:
update_hists!
. There's nothing there that can be trivially SIMD'd because there are gather/scatter load/stores. So if the macro is triggering it is likely triggering erroneously anyway. You can check for SIMD instructions with InteractiveUtils.@code_native
or InteractiveUtils.@code_llvm
.@polly
macro but it's not documented and might require a special build of Julia/LLVM. Might not hurt to try it though. I suspect Julia may get some macros one day for this. There is a TiledIteration.jl package that will generate the index chunks for you. Hopefully it doesn't allocate.X_bin[𝑖[i], feat] <= cond_bin
? I like to call these kinds of variables ii
for "index index" rather than just i
because an ii
is not valid to use directly in the base data.Thank you for these hints!
I've spent most of my recent efforts to bring the EvoTrees ready for a release baesd on the restructuring to better handle the various optimisations. In particular, refactoring the GPU part to handle the move to the set split. It would be part of the v0.8.0 release.
On Julia 1.6.1 (which saw the main pain point about loops slowdown resolved), EvoTrees is now 3X slower on your benchmark (15s vs 5s). Allocations have also been reduced significatively. EvoTrees initialization is slow, mostly because of the binning process, which takes about 2 sec out of the 15 sec, so that could be another low hanging fruit, though less impactful for large iterations.
- I would avoid the @simd macro for
update_hists!
Benchmark shows slight speedup for some reason when adding the @simd and with correct results, so I would let it there at the moement. That being said, given the significant speed gap with your implementation, my understanding is that it's likely on the histograms building that I should next focus with the tiled+simd approach. If it's something feasible on your side, having PoC of such implementation that performs the equivalent of this hist building would be much helpful. Having it benchmark could confirm whether it's really that hist building that's the remaining bottleneck. Otherwise, I'd assume it's the remaining memory allocations during the loop that hurts the process.
5. Shouldn't it be
X_bin[𝑖[i], feat] <= cond_bin
?
Absolutely! Thanks for pointing this out. A multi-thread version has also been implemented.
Benchmark shows slight speedup for some reason when adding the @simd and with correct results, so I would let it there at the moment.
With @code_native I see it is using SIMD instructions, but not in any useful way: it's using SIMD registers to load/add one float at a time, just like it would with normal registers. I'm guessing it's dumb luck or lower register pressure that makes it faster. I think x86 kind of doesn't have enough general purpose registers.
confirm whether it's really that hist building that's the remaining bottleneck
I'm sure it is. It's the only part of the computation that's O(nobs*nfeatures). Set partitioning is O(nobs) and split selection is O(nbins*nfeatures).
tiled+simd PoC
On it.
Maybe we can merge projects?
EvoTrees:
MemoryConstrainedTreeBoosting:
MemoryConstrainedTreeBoosting.jl is a library I've been working on for a couple years that would allow me to control the loading and binning of data, so I could (a) do feature engineering in Julia and (b) use all the memory on my machine for the binned data. I've also spent a lot of time on speed because 10% faster ≈ training done 1 day sooner for my data sets. I think it is quite fast. I didn't bother documenting the library until today, however.
With our powers combined...!
A benchmark below. 4 threads on my 2013 quad-core i7-4960HQ.