[NEW FEATURE] GPU Versions of Optimization Algorithms

chaithyagr commented 5 years ago

Describe the solution you'd like We would want to implement a GPU version of the algorithms, which would carry out complete descent on GPU without any switch to and fro from GPU to CPU.

Describe alternatives you've considered Something similar to cdfit in https://github.com/rapidsai/cuml/blob/branch-0.10/cpp/src/solver/cd.h This is a simple stochastic Gradient descent.

My ultimate goal is that finally, we should have Linear operators, Gradient Operators and Fourier Operators working on GPU memory efficiently, and Modopt can then just work completely on GPU giving a large speedup.

Are you planning to submit a Pull Request?

[X] Yes, but as a low priority.
[] No

sfarrens commented 5 years ago

@chaithyagr Sounds good to me. I would be happy to work on PR with you for this.

chaithyagr commented 5 years ago

Just noting options at this time. We could go with pyCUDA, pyOpenCL (as it works on any GPU), NUMBA.

However, one recent project that caught my eye which is very helpful and will help us here is cuPY https://cupy.chainer.org/

CuPY is basically a numpy like a library with all the compute occurring in the GPU than on CPU

The reason I feel this is the best because:

1) Easy movement from GPU to CPU We could basically have the same codes with different data type based on the parameter, like here:

import numpy as np
import cupy as cp
if use_gpu:
    a = cp.array(1,2,3)
else:
    a = np.array(1,2,3)

[Same computations, hopefully, cuPY supports for most important functions that we use]

2) Ease of code maintenance and management: From what I have seen, other methods would essentially create a separate code base for GPU implementations which would make maintenance double the time.

3) Support for custom kernels: One important aspect of cupy is that it provides the option to define a custom kernel which we could define for some special functions, or some speeded up versions.

But downfalls of this method primarily is that we would be limited by cuPY and its limitations (not all numpy functions are implemented yet)

zaccharieramzi commented 5 years ago

How about pytorch?

chaithyagr commented 5 years ago

Yup that also is good, just that torch has a different API structure, so we could not ideally do same as numpy stuff. I'll take a deeper look into torch and try to get an idea of pros and cons just like with cupy.

Also and edit to my earlier post

if use_gpu:
     import cupy as np
else
     import numpy as np

Would this work? I mean we surely will have corner cases which we need to handle.

grlee77 commented 5 years ago

I would second the recommendation for CuPy. It closely mirrors the numpy API, albeit with some holes in implementation here and there. There is also a smaller subset of SciPy's API available via the cupy.cupyx.scipy module. I think this makes it much easier to write code for NumPy first and then have it be fairly trivial to add CuPy support.

I don't have as much experience with pytorch, but I seem to recall that it provides GPU arrays, but not in a way that is as closely matched to the NumPy API.

chaithyagr commented 5 years ago

@sfarrens and I had a short discussion on this. Here is a brief summary:

We decided to move towards implementations with CuPy due to all the above advantages.

Implementations:

1) We divided the work into phases:

a) **Phase 1 :** 
  * Implement GPU version for Forward Backward optimization with Identity linear operator. Also, develop tests to test the speedup seen by using different profilers. 
  * This phase involves setting up the flow and ensuring a complete end to end implementation of one algorithm 
  * Further details in Next Steps below.

b) Phase 2a :

Move to generalize the above implementation for all other algorithms.

c) Phase 2b :
- This can happen in parallel to Phase 2a, although based on priorities and resources, we could parallelize / sequence in the right order
- Explore to efficient GPU implementations of different Operators (Linear: Wavelets, Fourier: NFFT, FFT etc)

Logistics:

1) We have started a branch gpu to hold this implementation. 2) Started a GPU Dev project with the first milestone as 31st October for Phase 1

Next Steps:

1) We plan to hold on to this issue to track what features are currently implemented in GPU branch. This issue would have the up to date status of this project. 2) For now, we will instance 2 issues for Phase 1 for:

Implementing Forward Backward on GPU based on a metaclass that switches between GPU or CPU implementations based on a user input : @chaithyagr will handle this
For working towards obtaining base test, profiler and setting up a right travis to run on GPU. @sfarrens will be handling this.

chaithyagr commented 4 years ago

@zaccharieramzi I think we can continue our conversations here as they will be documented.

We were still discussing on which is best way to move from CPU to GPU versions of codes. @zaccharieramzi made a right point that moving to tf or torch would ensure that we need to validate the codes only on CPU on travis while the GPU versions are guranteed to run. My opinion is that the amount of codes that needs to be written to do this is large, but I will give this a try, and mostly will use torch to implement basics that I implemented in #76 . @zaccharieramzi any comments?

zaccharieramzi commented 4 years ago

Well the first thing is I think we need to understand which parts of the code exactly need to be modified. To me it's not all clear.

By that I mean we need to spot:

initialization of numpy variables (be it from numpy or from data). Normally this should be easy to migrate to either cupy, torch, tf
operations on numpy variables:
- assignments on slices (tricky on tf, don't know for cupy or torch)
- classical vector operations (reduce operations like sum, norm, or vectorized operations like exp, log etc)
- comparison of results to scalars (in torch you need to put the element back on cpu if in gpu, in tf you need to get the value in all cases, don't know for cupy). It might mean sometimes having scalar values as tensors in order to avoid gpu-cpu overhead.
- complex scipy functions: e.g. svd (might be the only one)
- others?

Then, I don't understand the order of the implementation. Basically if we first modify the algorithms code to have them working on GPU, there will be a lot of cpu-gpu overhead because we have to take the results of the operators which will be on CPU and put them on GPU. And the most costly operations are the operators, not the ones run by the algorithms if I am correct (think NFFT, FFT, Wavelets). You surely have thought of this before, but it's not clear in the conversation.

Finally, if we chose to go for a framework that allows transparent change from CPU to GPU for the user (torch or tf), I don't know how to chose.

grlee77 commented 4 years ago

assignments on slices (tricky on tf, don't know for cupy or torch)

Most advanced indexing works in CuPy, although there are a few corner cases that do not behave exactly as in NumPy. This is documented at: https://docs-cupy.chainer.org/en/stable/reference/difference.html

comparison of results to scalars (in torch you need to put the element back on cpu if in gpu, in tf you need to get the value in all cases, don't know for cupy). It might mean sometimes having scalar values as tensors in order to avoid gpu-cpu overhead.

This is also covered in the link above for CuPy. Basically things like x.mean() that would return a scalar instead return a 0-dimensional cupy.ndarray to avoid device/host transfer overhead. This can be transferred back to the hose by the user if needed.

For example cupy.asarray(3) == 3 would return (a boolean GPU array with one element equal to True instead of just True). Such simple scalar operations are relatively expensive compared to NumPy, but things like reduction, FFT, etc are MUCH faster on the GPU as long as the number of array elements is large enough.

zaccharieramzi commented 4 years ago

although there are a few corner cases that do not behave exactly as in NumPy

Yes in particular out-of-bounds indices which strongly suggests that we will need 2 different CIs one for CPU and one for GPU, which I feel is a huge burden.

Your second point is definitely in favor of cupy.

chaithyagr commented 4 years ago

I think at this point we just need to debate about tradeoff of having 2 CIs vs amount of extra codes / code changes needed. .......................................... It looks like my earlier replies from my phone didnt make through for some reason, but @grlee77 did answer some parts.

Then, I don't understand the order of the implementation. Basically if we first modify the algorithms code to have them working on GPU, there will be a lot of cpu-gpu overhead because we have to take the results of the operators which will be on CPU and put them on GPU. And the most costly operations are the operators, not the ones run by the algorithms if I am correct (think NFFT, FFT, Wavelets).

I agree with you that the operators surely are the costly operations. In my opinion, both algorithm and operator code support for GPU is equally important. The reason for this is that the changes to Modopt is more universal and once done can be used directly at other places (I already have some other codes that I could move to GPU and get a really good speed up). In my opinion, we need to equally spend efforts on moving Wavelet operations to occur on GPU and at the same time have Modopt to run with GPU.

operations on numpy variables:

Here is all the supported functions from Numpy and Scipy on GPU : https://docs-cupy.chainer.org/en/stable/reference/comparison.html

.................................. Further, I wanted to point out an extra advantage of CuPy, that it has good interoperability: https://docs-cupy.chainer.org/en/stable/reference/interoperability.html . We can pretty much import data from tensors and carry out computations without any overheads of moving data back to CPU. This way, even if we have any torch implementations of some operators, we can still work with it if the core Module runs on CuPY. Although, I think a similar advantage might exist for tf or torch which I may not be aware about.

zaccharieramzi commented 4 years ago

In my opinion, we need to equally spend efforts on moving Wavelet operations to occur on GPU and at the same time have Modopt to run with GPU.

Of course, but my main point is about the order in which to tackle these problems.

chaithyagr commented 3 years ago

@sfarrens same goes here.. We can open separate issues if needed... I just realized I can close it. Feel free to reopen it if you disagree with me

CEA-COSMIC / ModOpt