Avoid loading full bb: We actually only need the node width in compute_forces, not the full min and max corners of the bb, so this PR just calculates the node width at tree construction. During compute_forces we then only load the node width. With this, instead of loading 6 * sizeof(T) bytes we only load sizeof(T) per distance threshold calculation.
Avoid divide and std::abs calls in distance threshold by reformulating the equation
With this, time per step for 7 million particles drops from 2.8s to 1.8s on my GPU with theta=0.5. (outdated, see EDIT)
Not sure if we need to include this in the paper, as most of our conclusions likely won't change. Maybe useful for future work, or for the camera ready version.
EDIT: I've added an additional improvement. We are now no longer storing monopole mass and position separately, but instead using a single 4-component vector to store both (in 3D case). This not only simplifies memory access pattern, it also causes our vec objects to become aligned such that compilers can emit vector loads. With this, time per step drops further to 1.3s..
compute_forces
, not the full min and max corners of the bb, so this PR just calculates the node width at tree construction. Duringcompute_forces
we then only load the node width. With this, instead of loading6 * sizeof(T)
bytes we only loadsizeof(T)
per distance threshold calculation.std::abs
calls in distance threshold by reformulating the equationWith this, time per step for 7 million particles drops from 2.8s to 1.8s on my GPU with theta=0.5. (outdated, see EDIT)
Not sure if we need to include this in the paper, as most of our conclusions likely won't change. Maybe useful for future work, or for the camera ready version.
EDIT: I've added an additional improvement. We are now no longer storing monopole mass and position separately, but instead using a single 4-component vector to store both (in 3D case). This not only simplifies memory access pattern, it also causes our
vec
objects to become aligned such that compilers can emit vector loads. With this, time per step drops further to 1.3s..