Additional plots and images

FHof / torchquad

Multidimensional numerical integration on the GPU using PyTorch

https://www.esa.int/gsp/ACT/open_source/torchquad/

GNU General Public License v3.0

0 stars 0 forks source link

Additional plots and images #13

Open FHof opened 2 years ago

FHof commented 2 years ago

This is an issue to show trace visualisations, benchmarking plots and other images which may be interesting.

runtime_comparison 2021 11 16.zip

FHof commented 2 years ago

Traces for 1D Boole integration; N: 17850621, precision: float32, integrand: sin_prod (anp.prod(anp.sin(x), axis=1))

uncompiled tensorflow: linspace is apparently executed on the CPU and the result is copied to the GPU. With tf.debugging.set_log_device_placement(True) tensorflow claimed that all operations were executed on GPU, which is contrary to the profiling and benchmarking measurements. tf_uncompiled

uncompiled jax: It looks like the gather operation for the slices calculates indices on the CPU and then transfers them to the GPU. jax_uncompiled jax_uncompiled_gather

I tried to force the execution on the GPU with tensorflow and jax and had no success. With the tensorflow with tf.device('/GPU:0'): context manager it still executed linspace on CPU. To test if the context manager works in my code I also tried with tf.device('/CPU:0'):, which correctly forced all execution on the CPU. For jax I tried jax.device_put(integration_domain, jax.devices()[0]) so that the integration domain and tensors calculated from it have a committed device; the indices for slicing were still gathered on CPU. I also tried float64 precision and tensorflow, resp. jax, still executed those operations on CPU.

With compilation these parts are executed on the GPU and the integration is more than 10 times faster.

Corresponding code and measured data: measurement_data.zip

Plots with points per dimension for 1D and 2D: In 2D there are less points per dimension, so the linspace operation in tensorflow is not a bottleneck. parts_Boole_step1_1_unused_float32 parts_Boole_step1_2_unused_float32

FHof commented 2 years ago

Here's a picture of a Snakeviz visualisation of cProfile output for the current VEGAS integrator on GPU with N=50000, dim=4, sin_prod integrand, float32 precision. VEGASMap.accumulate_weight, VEGASMap.update_map and VEGASStratification.accumulate_weight seem to be the performance bottleneck. It looks similar on CPU with different number of points, except with a very high number of points (N=500000) on CPU, where VEGASMap.accumulate_weight requires significantly more time than the other two slow functions. gpu cprofile vegas 50000

FHof commented 2 years ago

VEGAS vectorisations profiled

Snakeviz visualisations of cProfile output for VEGAS integration with torch, CUDA, dim=4, N=300000, no CUDA_LAUNCH_BLOCKING, and no gradients. I compared the measurements before and after the changes of #28. I couldn't use the pytorch profiler because, at least before the changes, this profiler significantly slowed down the code and produced very big output files.

VEGAS before code changes: torch_cprofile_output N300000 before

VEGAS after the changes: torch_cprofile_output N300000 after

FHof commented 2 years ago

VEGAS implementation accuracy comparison with the gplepage tutorial integrand

Integrand from: https://vegas.readthedocs.io/en/latest/tutorial.html#basic-integrals Code and measurement data used for the plots: vegas_peak measurements.zip I used different parameters for the implementation:

For VegasFlow I initialized VegasFlowPlus with N//7 points per iteration, executed run_iteration with 7 iterations, and used the default float precision (which is 64 bit according to the documentation).
For gplepage vegas I initialized vegas.Integrator with N // (5 + 7) points per iteration, executed the integrator with 5 iterations to adapt the grid, and then executed integrator with 7 iterations and alpha=0.1 to calculate the result.
For torchquad I passed N directly to the integrate method, used the default arguments, e.g. the warmups are enabled, included the changes from #34 and set the precision to float32.

The measurements indicate that all these VEGAS implementations perform better than (torchquad's) MonteCarlo with this integrand: vegas_accuracy_vegas_peak

Like the accuracies, the required times depend on the implementation-specific configuration. I did not write code to measure times reliably, e.g. by synchronizing CUDA with torch. Nonetheless, here's a plot for the times: vegas_time_vegas_peak

FHof commented 2 years ago

Here's an accuracy comparison with the gaussian_peaks integrand and the same setup as before. raw measurements: tmp_vegas_measurements.csv.zip accuracy plot: vegas_accuracy_gaussian_peaks time plot: vegas_time_gaussian_peaks

FHof commented 2 years ago

Benchmarking with gradient calculation

The gradient is over the integration domain.

Parts compiled excludes the compilation of the backward step. In comparison to JAX and Tensorflow, with PyTorch the backward step is also uncompiled in the "all compiled" case.

Results look similar with the simple sin_prod and complicated gaussian_peaks integrands, and somewhat similar to measurements without gradient calculation.

With compilation, torch, gaussian_peaks the first measurement was very slow, so the benchmarking script aborted.

gradient_comparison_Boole_3_sin_prod (gradient)_float32 all in one gradient_comparison_Boole_3_sin_prod (gradient)_float32

compilations_Boole_2_gaussian_peaks (gradient)_float32

Measurements and code: gradients benchmarked.zip

FHof commented 2 years ago

VEGAS profiled with torch.profile

Code used for profiling: benchmarking_and_profiling.zip, commit e2cd8b0b0357ab8 When I try the profiling with CPU, it hangs unless I disable the profile_memory argument; this doesn't happen with CUDA. The tensorboard trace does not graphically visualize the Python3 functions but only the low-level aten operations although I set the with_stack=True argument in torch.profiler.profile and clicking on an operation shows the related file and line number in the Python3 VEGAS implementation files. The memory units "MB" are inaccurate; they are MiB.

float32, N=200000000, no gradients

Operator view: operator_view

Trace: trace

Memory usage (all and zoomed-in): gpu_mem_mib gpu_mem_mib_zoomed

float32, N=100000000, with gradients

Trace (all and zoomed-in): gradients trace gradients trace zoomed

Memory usage: gradients gpu mem mib

FHof commented 2 years ago

parts compilation with and without blocking

The benchmarking script by default blocks execution before and after the integrand evaluation with torch.cuda.synchronize() (torch) and .block_until_ready() (jax) for the parts compiled case but not the all compiled and uncompiled cases. I added this blocking so that the median times depend less on the integrand complexity.

In the plots only two curves (cases) are affected: JAX parts compiled and PyTorch parts compiled. The other curves are for easier comparison. For big N, there's no noticeable difference. For small N PyTorch without blocking is slightly faster. Around N=10^6 the times raise earlier with blocking.

with blocking: blocking compilations_Boole_3_sin_prod_float32

without blocking: no blocking compilations_Boole_3_sin_prod_float32

FHof commented 2 years ago

VEGAS gradients memory usage

Changes from https://github.com/FHof/torchquad/commit/b1bf122a32afdc39912eb7b10b4bf7baa1d348ee (previous changes in PR #35) are included. Commit which halves the memory usage: https://github.com/FHof/torchquad/commit/b1bf122a32afdc39912eb7b10b4bf7baa1d348ee (this is used for the before-after comparison)

When resetting the results it does not reach zero probably because of the VEGASStratification. This is not a big problem because the tensors in VEGASStratification are reset in each iteration.

With 50 (instead of just 10) iterations the plot shows that before the change it may have kept only the last five deleted tensors instead of all of them (which we suspected), which leads to twice the memory usage. Perhaps the scope of chi2 and res_abs in Python3 covers the whole while body and thus the Python3 interpreter cannot delete these variables before the next five iterations.

50 iterations (by setting a very large max_iterations), N=100000000 before: before

after: after

FHof commented 2 years ago

CPU measurement results

The data may be noisy and inaccurate because it was measured on a multi-user system, I did neither en-/ or disable hyperthreading nor use a fixed clocking, and the measurements for each dot were taken in their own small time intervals. The CPU is a 24-core AMD EPYC 7402. compilations_Boole_3_sin_prod_float32