Substantial Hilbert tree improvement: Improve distance threshold calculation & memory access pattern

Avoid loading full bb: We actually only need the node width in compute_forces, not the full min and max corners of the bb, so this PR just calculates the node width at tree construction. During compute_forces we then only load the node width. With this, instead of loading 6 * sizeof(T) bytes we only load sizeof(T) per distance threshold calculation.
Avoid divide and std::abs calls in distance threshold by reformulating the equation

With this, time per step for 7 million particles drops from 2.8s to 1.8s on my GPU with theta=0.5. (outdated, see EDIT)

Not sure if we need to include this in the paper, as most of our conclusions likely won't change. Maybe useful for future work, or for the camera ready version.

EDIT: I've added an additional improvement. We are now no longer storing monopole mass and position separately, but instead using a single 4-component vector to store both (in 3D case). This not only simplifies memory access pattern, it also causes our vec objects to become aligned such that compilers can emit vector loads. With this, time per step drops further to 1.3s..

UoB-HPC / stdpar-nbody

Substantial Hilbert tree improvement: Improve distance threshold calculation & memory access pattern #39