Open FHof opened 2 years ago
Traces for 1D Boole integration; N: 17850621, precision: float32, integrand: sin_prod (anp.prod(anp.sin(x), axis=1)
)
uncompiled tensorflow:
linspace is apparently executed on the CPU and the result is copied to the GPU.
With tf.debugging.set_log_device_placement(True)
tensorflow claimed that all operations were executed on GPU, which is contrary to the profiling and benchmarking measurements.
uncompiled jax: It looks like the gather operation for the slices calculates indices on the CPU and then transfers them to the GPU.
I tried to force the execution on the GPU with tensorflow and jax and had no success.
With the tensorflow with tf.device('/GPU:0'):
context manager it still executed linspace on CPU. To test if the context manager works in my code I also tried with tf.device('/CPU:0'):
, which correctly forced all execution on the CPU.
For jax I tried jax.device_put(integration_domain, jax.devices()[0])
so that the integration domain and tensors calculated from it have a committed device; the indices for slicing were still gathered on CPU.
I also tried float64
precision and tensorflow, resp. jax, still executed those operations on CPU.
With compilation these parts are executed on the GPU and the integration is more than 10 times faster.
Corresponding code and measured data: measurement_data.zip
Plots with points per dimension for 1D and 2D: In 2D there are less points per dimension, so the linspace operation in tensorflow is not a bottleneck.
Here's a picture of a Snakeviz visualisation of cProfile output for the current VEGAS integrator on GPU with N=50000, dim=4, sin_prod integrand, float32 precision. VEGASMap.accumulate_weight, VEGASMap.update_map and VEGASStratification.accumulate_weight seem to be the performance bottleneck. It looks similar on CPU with different number of points, except with a very high number of points (N=500000) on CPU, where VEGASMap.accumulate_weight requires significantly more time than the other two slow functions.
Snakeviz visualisations of cProfile output for VEGAS integration with torch, CUDA, dim=4, N=300000, no CUDA_LAUNCH_BLOCKING, and no gradients. I compared the measurements before and after the changes of #28. I couldn't use the pytorch profiler because, at least before the changes, this profiler significantly slowed down the code and produced very big output files.
VEGAS before code changes:
VEGAS after the changes:
Integrand from: https://vegas.readthedocs.io/en/latest/tutorial.html#basic-integrals Code and measurement data used for the plots: vegas_peak measurements.zip I used different parameters for the implementation:
The measurements indicate that all these VEGAS implementations perform better than (torchquad's) MonteCarlo with this integrand:
Like the accuracies, the required times depend on the implementation-specific configuration. I did not write code to measure times reliably, e.g. by synchronizing CUDA with torch. Nonetheless, here's a plot for the times: vegas_time_vegas_peak
Here's an accuracy comparison with the gaussian_peaks integrand and the same setup as before. raw measurements: tmp_vegas_measurements.csv.zip accuracy plot: vegas_accuracy_gaussian_peaks time plot: vegas_time_gaussian_peaks
The gradient is over the integration domain.
Parts compiled excludes the compilation of the backward step. In comparison to JAX and Tensorflow, with PyTorch the backward step is also uncompiled in the "all compiled" case.
Results look similar with the simple sin_prod and complicated gaussian_peaks integrands, and somewhat similar to measurements without gradient calculation.
With compilation, torch, gaussian_peaks the first measurement was very slow, so the benchmarking script aborted.
Measurements and code: gradients benchmarked.zip
Code used for profiling: benchmarking_and_profiling.zip, commit e2cd8b0b0357ab8
When I try the profiling with CPU, it hangs unless I disable the profile_memory
argument; this doesn't happen with CUDA.
The tensorboard trace does not graphically visualize the Python3 functions but only the low-level aten operations although I set the with_stack=True
argument in torch.profiler.profile
and clicking on an operation shows the related file and line number in the Python3 VEGAS implementation files.
The memory units "MB" are inaccurate; they are MiB.
Operator view:
Trace:
Memory usage (all and zoomed-in):
Trace (all and zoomed-in):
Memory usage:
The benchmarking script by default blocks execution before and after the integrand evaluation with torch.cuda.synchronize()
(torch) and .block_until_ready()
(jax) for the parts compiled case but not the all compiled and uncompiled cases.
I added this blocking so that the median times depend less on the integrand complexity.
In the plots only two curves (cases) are affected: JAX parts compiled and PyTorch parts compiled. The other curves are for easier comparison. For big N, there's no noticeable difference. For small N PyTorch without blocking is slightly faster. Around N=10^6 the times raise earlier with blocking.
with blocking:
without blocking:
Changes from https://github.com/FHof/torchquad/commit/b1bf122a32afdc39912eb7b10b4bf7baa1d348ee (previous changes in PR #35) are included. Commit which halves the memory usage: https://github.com/FHof/torchquad/commit/b1bf122a32afdc39912eb7b10b4bf7baa1d348ee (this is used for the before-after comparison)
When resetting the results it does not reach zero probably because of the VEGASStratification. This is not a big problem because the tensors in VEGASStratification are reset in each iteration.
With 50 (instead of just 10) iterations the plot shows that before the change it may have kept only the last five deleted tensors instead of all of them (which we suspected), which leads to twice the memory usage. Perhaps the scope of chi2 and res_abs in Python3 covers the whole while body and thus the Python3 interpreter cannot delete these variables before the next five iterations.
50 iterations (by setting a very large max_iterations), N=100000000 before:
after:
The data may be noisy and inaccurate because it was measured on a multi-user system, I did neither en-/ or disable hyperthreading nor use a fixed clocking, and the measurements for each dot were taken in their own small time intervals. The CPU is a 24-core AMD EPYC 7402.
This is an issue to show trace visualisations, benchmarking plots and other images which may be interesting.
runtime_comparison 2021 11 16.zip