Installation / GPU acceleration

brandondube commented 7 months ago

This issue is part of my JOSS review at https://github.com/openjournals/joss-reviews/issues/6315

I created an environment to test your package with on Windows 11 and a system with an Nvidia 4080 GPU:

$ conda create -n pyslm2rev numpy scipy matplotlib tensorflow pytest
$ conda activate pyslm2rev
$ cd pySLM2
$ pip install -e .

Your tests passed:

pytest . Output

================================================= test session starts =================================================
platform win32 -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0
rootdir: C:\Users\brand\src\pySLM2
collected 41 items

test\test_analysis.py ..                                                                                         [  4%]
test\test_backend.py s                                                                                           [  7%]
test\test_profile.py ..................                                                                          [ 51%]
test\test_simulation.py ...s.                                                                                    [ 63%]
test\test_slm.py ..ss.....                                                                                       [ 85%]
test\test_util.py ssssss                                                                                         [100%]

================================================== warnings summary ===================================================
..\..\miniconda3\envs\pyslm2rev\lib\site-packages\tensorflow\python\framework\dtypes.py:246
  C:\Users\brand\miniconda3\envs\pyslm2rev\lib\site-packages\tensorflow\python\framework\dtypes.py:246: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
    np.bool8: (False, True),

..\..\miniconda3\envs\pyslm2rev\lib\site-packages\flatbuffers\compat.py:19
  C:\Users\brand\miniconda3\envs\pyslm2rev\lib\site-packages\flatbuffers\compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib and slated for removal in Python 3.12; see the module's documentation for alternative uses
    import imp

..\..\miniconda3\envs\pyslm2rev\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:326
  C:\Users\brand\miniconda3\envs\pyslm2rev\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:326: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
    np.bool8: (False, True),

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================== 31 passed, 10 skipped, 3 warnings in 18.83s =====================================

Tensorflow did not recognize my GPU, even though CuPy does. The version of tf that conda picked up was tensorflow-2.10.0-mkl_py310hd99672f_0. I uninstalled tf with conda and installed it through pip, same issue:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
>>> Num GPUs Available:  0

The version of tf that pip picked up was tensorflow-intel-2.15.0. To show that it is not an issue with my system, I installed CuPy into the same environment, and it worked just fine:

$ conda install -c conda-forge cupy
$ python
Python 3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy as cp
>>> cp.ones(2)
array([1., 1.])

As a hail mary, I tried more-or-less the same in WSL2:

conda create -n pyslm2rev -c conda-forge numpy scipy matplotlib tensorflow pytest cupy

This grabs tf 2.12-mkl on python 3.9. Same problem.

WSL2 tf failure to see GPU, CuPy working:

$ python
Python 3.9.18 (main, Sep 11 2023, 13:21:18)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))2024-02-17 10:27:46.993035: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Num GPUs Available:  0
>>> import cupy as cp
>>> cp.ones(2)
array([1., 1.])

In your paper, you wrote:

Under the hood, the package uses TensorFlow for numerical computations. By leveraging TensorFlow, the package harnesses the power of GPUs for faster computation without the need for code modification. This results in a significant speed-up for algorithms that are computationally expensive but benefit from parallelization, such as many hologram generation algorithms relying on iterative Fourier transformations.

While I understand that in theory you can use the same lines of code to work with either a CPU or GPU using tensorflow or other libraries of its ilk (pytorch, say), this has nor borne out in my case here, putting in considerable effort. I do not have a personal bare metal linux box with a GPU to test with.

Have you demonstrated the claim in the paper? If so, could you share the result?

ldes89150 commented 7 months ago

Hey @brandondube

To install the GPU accelerated version, one can do pip install tensorflow[and-cuda]. Though we link to the official TensorFlow's page for installation, I will add more documentation on the tensorflow's installation to make it more clear. Thank you very much for the feedback!

Colab provides environments with necessary dependencies installed. (make sure to choose the GPU runtime). Here is an example, I run the tests with the GPU accelerated tensorflow. https://colab.research.google.com/gist/ldes89150/db64b42aaa0edbef9eb583fd3e36a935/pyslm2.ipynb

brandondube commented 7 months ago

Google's provided installation documentation is here:

https://www.tensorflow.org/install

Which dictates to do as I did,

# Current stable release for CPU and GPU
pip install tensorflow

I added tf.debugging.set_log_device_placement(True) to the collab notebook and ran examples/create_multiple_gaussian_beam.py there. It took ~16 sec, and tf indicated everything ran on the GPU. It takes about two seconds on my machine, with CPU. Understanding that the collab environment is quite different to my machine which is bare metal, I disabled the GPU in collab with tf.config.set_visible_devices([], 'GPU'). The debug printing I added showed everything was being done on CPU, and the sample example took 7 seconds. These timings are repeatable.

I am not sure how to confirm that by using tensorflow the calculations are faster on a GPU.

In general, you are satisfying the important criteria; your arrays are 32-bit floats, and large (1920 x 1920). At these array sizes you should see a big speedup. I ran !nvidia-smi and !lscpu to probe the collab environment, and it has an Nvidia T4 GPU (~8 Tflops in fp32). The CPU comes up as model 6 family 79, which is Broadwell. From the core clock and cache, it has to be an E5-2699 v4. Microway lists this CPU as having lets say about 0.35 Tflops of compute, for all cores. Since the VM has only one core / two threads, it has 1/22nd of that, about 0.016 Tflops. The theoretical performance of the GPU is about 500x higher.

I installed CuPy into the same collab notebook (or tried to, it was already installed); !pip install cupy-cuda12x. I then ran a test:

import numpy as np
import cupy as cp

# old style numpy random api does not have a dtype= kwarg
acpu = np.random.rand(1024,1024).astype(np.float32)
bcpu = np.random.rand(1024,1024).astype(np.float32)

agpu = cp.random.rand(1024,1024).astype(cp.float32)
bgpu = cp.random.rand(1024,1024).astype(cp.float32)

%timeit acpu * np.exp(1j*bcpu)
37.1 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit agpu * cp.exp(1j*bgpu)
72.6 µs ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The ratio of these times is 511x, almost exactly in-line with the difference in their theoretical floating point performance. So collab's hardware is very much in-line with my expectations.

As-is, I could not check the box for the review:

[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

I can give you some pointers for how to optimize code for both CPU and GPU, as well as a small amount of code (~12 lines) that lets you hot swap your computational backend so you are not tied to numpy or tensorflow or cupy or pytorch or any other; provided they all use the same API, or same-enough; which they all do.

ldes89150 commented 6 months ago

Currently, we either run the tensorflow on windows native or bare metal linux in the lab. For the windows native, you will need to install version 2.10 or below. https://www.tensorflow.org/install/pip#windows-native

As for the GPU speed up, we observed orders of magnitude speed up on algorithms heavily relying on iterative FFT. We reported the IFTA algorithm took ~5 secs with a Nvidia RTG 2070 Super in our 2021 paper.

Here is an example of running Gerchberg-Saxton algorithm with 200 iterations. https://colab.research.google.com/gist/ldes89150/1f8ca95b29324f82de437737b77982c0/pyslm2.ipynb

It took about 90 secs with CPU and about 2.5 secs with GPU. We are working on adding a section in the document to discuss the speed up one can possibly get with GPU and documenting some bench-marking results.

I can give you some pointers for how to optimize code for both CPU and GPU, as well as a small amount of code (~12 lines) that lets you hot swap your computational backend so you are not tied to numpy or tensorflow or cupy or pytorch or any other; provided they all use the same API, or same-enough; which they all do.

Yes, please! We are considering making some adjustments on the computational backend for future major release. One option we are thinking of is using jax and its XLA compiler to further improve the performance.

brandondube commented 6 months ago

It is a bit tricky to assess the paper's statement for accuracy, as-is then. If your iterative transform algorithms do show a speedup but other code runs slower, it is in the weeds of the statement whether it is true because of the "such as" clause, or not true in general because some code is several times slower as-is on GPU.

Looking at your benchmark with 200 iterations, I get a significant non-determinism on the GPU timing; sometimes about 2.2 seconds, other times as long as 3.8 seconds. In any event, both of these are significantly shorter than the CPU timing. and it is a shared environment.

To have a highly interchangeable backend, you could borrow my BackendShim, which places a tiny bit of indirection in front of "numpy and scipy." Any library that has the same API can be placed behind the curtain, and the code will work with no rewrite. You can/should combine this with a configuration object that controls the numerical precision used.

In our "Exascale" paper we reported comparative benchmarks of the same code on 2x Intel 6248R CPUs vs one Titan XP GPU. Each 6248R has 768 Gflops per Intel, so best case we would expect (12.15/(2*.768)) ~= 8x speedup. We do a bit better, a bit over 10x, because not all of the algorithms use all CPU cores, while all of them use all GPU cores. This is the regime you should be in, for code well written to use either CPU or GPU.

I have experimented with most of the machine learning libraries, and found none of them to compete well with CuPy for optics-y code. Jax can be very picky about all array sizes being known a priori, and both pytorch and tensorflow can get angry if you do certain operations such as array insertion/deletion (which is often used e.g. to zero-pad and crop). Jax's issues have to do with its JIT compiler being critically dependent on pre-allocating all of the memory for the entire program, and for torch and tf it has to do with their gradient backpropagation logic forbidding some operations such as in-place array modification.

If you do try jax, you will find that it runs most of its functions on about 2 CPU cores (regardless of machine size, from 8 to 96 logical CPUs it is always two). This makes it the fastest on CPU, in general. The JIT compiler is mostly useless for optics code, which consists of a sequence of large array operations that cannot be done any more optimally than just writing the code out. The things that make optics code slow (re-generating the same grid over and over, say) are not something the JIT can elide.

For your problems that go slower on GPU than CPU, it is most likely the performance semantics differing between CPU and GPU, and "bad" operations being run on the GPU. The time for the CPU to do anything is more or less one clock cycle, which is <= 1 ns. The time for the GPU to do anything is one communication latency over the PCIe bus, which is about 10 microseconds. Scalar-ish calculations are faster on CPU, but large array calculations are faster on GPU. In somewhat rare pockets, you will see my code do

from prysm.mathops import np

import numpy as truenp

# some operations use truenp

This is to run the few things (like spectral weights, say) that would be slower on GPU, on the CPU.

Due to the MIT license, you can copy BackendShim directly into your library, if you like, as long as you copy its LICENSE with it. If I were re-writing it today I would make config.precision directly take a numpy-esque dtype, instead of an integer number of bits. Newer GPUs can use the tensorfloat32 datatype, and be ~ 10x faster with that than IEEE float32. When the worse rounding is not detrimental to the answer, it is a great-to-have. As-is my config type precludes using tf32. You can always sniff the dtype when it is set to find its complex complement with Numpy's more recent improvements to the dtype system. Likewise for float16, although there are very few optics calculations that can really use that dtype without catastrophic rounding.

brandondube commented 5 months ago

I wanted to let you know that while you have closed this issue, you have not resolved the whole of the problem. The documentation and papers' performance claims must reflect accurately the software's performance. As-is, broad claims are made about the code running faster on a GPU, while only a minority fraction of the library's scope does.

ldes89150 commented 5 months ago

GitHub automatically closed the issue after merging the PR linked to it. Let's me reopen it.

ldes89150 commented 5 months ago

I added tf.debugging.set_log_device_placement(True) to the collab notebook and ran examples/create_multiple_gaussian_beam.py there. It took ~16 sec, and tf indicated everything ran on the GPU. It takes about two seconds on my machine, with CPU. Understanding that the collab environment is quite different to my machine which is bare metal, I disabled the GPU in collab with tf.config.set_visible_devices([], 'GPU'). The debug printing I added showed everything was being done on CPU, and the sample example took 7 seconds. These timings are repeatable.

In this example, the majority of the time is actually used in plotting. (To make it faster, I should trim the array before feeding the data into pcolormesh for plotting). The computation for both CPU & GPU part takes <1 seconds. Comparing to the IO overhead, the computational speed up is not significant for this algorithm. The algorithms benefits the most are those relying on iterative Fourier transformations. They are also the ones have much higher computational costs. I will mention whether the algorithms benefit from the GPU in the document for better clarity.

As for JIT, I agree that JIT doesn't always help, but I do observe speed up with iterative Fourier transformation type algorithms. I think what JIT helps here is reducing the memory allocation/deallocation for the intermediate results. (JIT was an experimental features in some earlier tensorflow versions. It's switch was in the experimental configs in the earlier versions.)

Here I put some code to show the performance difference with/without JIT (set_jit) and with/without tracing (run_functions_eagerly):

https://colab.research.google.com/gist/ldes89150/73bc961960c79a730660e966a035c2aa/pyslm2.ipynb

This is the result running on a T4 instance (GPU):

No JIT With Tracing (Default)
time used: 3.390634536743164
time used: 3.3433678150177
time used: 3.1825449466705322
time used: 3.186124324798584
time used: 3.351879596710205
Enable JIT
time used: 2.9088780879974365
time used: 2.7242681980133057
time used: 2.7510483264923096
time used: 2.86055850982666
time used: 2.881774425506592
No JIT No Tracing (Without Optimization)
time used: 3.533553123474121
time used: 3.539891481399536
time used: 3.7536377906799316
time used: 3.537489652633667
time used: 3.535156726837158

brandondube commented 5 months ago

It is difficult for me to accept on face value your claim that the computation is less than one second when running the same code on CPU and GPU takes 7 and 16 seconds, respectively, when those timings are repeatable.

In any event, the claim in the paper and documentation is that

pySLM2 primarily relies on tensorflow for most of its numerical computations. For machines with compatible hardware, tensorflow can seamlessly utilize GPU acceleration to enhance performance, provided it is installed correctly.

Some algorithms running twice as slowly on GPU does not match this claim

ldes89150 commented 5 months ago

Here is the comparison. I removed the matplotlib related code from examples/create_multiple_gaussian_beam.py. I run the code twice. Due to tensorflow's tracing, the first run time is mostly dominated by the tracing (a few seconds), so it makes more sense to compare the second run time. CPU: https://colab.research.google.com/gist/ldes89150/b4f966ca0da7d4f368ed265fb17fc495/pyslm2.ipynb GPU: https://colab.research.google.com/gist/ldes89150/2926b1ff232a4358e6b6c32d2bf62d7e/pyslm2.ipynb

CPU takes ~0.8 seconds and GPU takes ~0.3 seconds for the hologram calculation.

ldes89150 commented 5 months ago

To address your concern, I will update the claims for better clarity in both the document and the paper to emphasize the GPU acceleration is for the computational heavy tasks.

brandondube commented 2 months ago

While I still think your depiction of the achieved GPU acceleration with tensorflow is overly optimistic and a significantly greater acceleration is possible with a bit more effort, I am satisfied that you have made a significant effort to improve the clarity of communication in this area.

QITI / pySLM2

Installation / GPU acceleration #13