NVIDIA / cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
https://docs.nvidia.com/cuda/cuquantum/
BSD 3-Clause "New" or "Revised" License
320 stars 63 forks source link

Pytorch and cuQuantum #60

Closed wcqc closed 11 months ago

wcqc commented 1 year ago

Hi,

This lib looks exciting!

Just wondering if it would be possible to integrate cuQuantum into a standard PyTorch program to build hybrid classical and quantum models which are able to use both CUDA and cuQuantum accelerations in one program seamlessly, e.g., using the Python cuStateVec API?

Looked at the Python API samples it seems they demonstrate mostly standalone usages. Any pointers or comments much appreciated.

Thanks!

leofang commented 11 months ago

Sorry for late reply @wcqc, as you might have noticed we've been busy at the 23.06 release that just came out today.

Given the flexibility of PyTorch and cuQuantum Python APIs, it is certainly possible to make them interoperate (we have internal code that does this), although care should be taken.

The low-level Python binding for cuStateVec often requires raw CPU/GPU pointers and the CUDA stream pointer. The former is accessible through PyTorch's Tensor.data_ptr(), and the latter torch.cuda.current_stream().cuda_stream. So, following our rich Python sample set and replacing the CuPy usage by PyTorch counterparts (assuming you don't use CuPy, don't care about CuPy-PyTorch interoperability, and just want PyTorch tensors in your code), you should get lifted pretty easily. Though I'd note that CuPy is one of our required dependencies 😄

Next, you should be careful to do in a way that's meaningful to PyTorch. Specifically, if you use PyTorch for writing QML code, you most certainly need the PyTorch's compute graph and autograd machinery to kick in so you can use it with a PyTorch optimizer. Then, you need to wrap your custom code that involves cuStateVec in, say, torch.autograd.Function (see this PyTorch page for detail). Need to make sure the pointers passed to cuStateVec APIs are valid and take care of the data lifetime (as we noted here). This is no different from C++ or CuPy based code.

Let us know if you have any specific questions!

wcqc commented 11 months ago

@leofang Thanks for the very detailed comments and pointers. May I first confirm two high-level questions:

  1. From your descriptions above, we can take cuStateVec as is (with some additional housekeeping) and integrate it into PyTorch such that PyTorch dynamic computation graph and autograd and everything else still work as expected?

  2. From the 2nd part of your description above, if we want to do 1. (i.e., Pytorch autograd with dynamic computation graph), CuPy/C++ is necessarily needed and simple replacements with Pytorch tensors won't work?

From the reading of this related issue: https://github.com/NVIDIA/cuQuantum/issues/40, it seems there are still unresolved caveats for making 1. above work? But from your descriptions above this is not the case?

Another question is related to the NCCL backend. As I understand NCCL currently does not support ops involving Complex data types (e.g., complex64), how is NCCL related to cuQuantum and using cuQuantum with PyTorch autograd?

leofang commented 11 months ago
  1. Yes
  2. No. What I meant is whatever you have to do to make it right with C/C++ or CuPy, you can also do it with PyTorch tensors.

From the reading of this related issue: #40, it seems there are still unresolved caveats for making 1. above work? But from your descriptions above this is not the case?

It's not the case. There's no blocker now. #40 is different in that it's for cuTensorNet not cuStateVec, but the idea is the same. You need to follow the PyTorch autograd documentation (that I linked above) and write your own forward/backward implementations using torch.nn.module, torch.autograd.Function, etc, in which you call our low-level APIs.

The extra work we plan to do for cuTensorNet (#40) is that we will do all these in a (near) future release, so that PyTorch users can have a contract() API that supports backprop out of box, without writing their own autograd. No plan for cuStateVec yet, since for now we don't have pythonic APIs for cuStateVec.

Another question is related to the NCCL backend. As I understand NCCL currently does not support ops involving Complex data types (e.g., complex64), how is NCCL related to cuQuantum and using cuQuantum with PyTorch autograd?

It's very broad question and I am not sure if it's in scope for this issue 😅 You can do manual parallelism as you would with MPI or NCCL for APIs that are designed to work with communication frameworks. For global sum or p2p with NCCL, you just need to treat a complex array as if it's real, but of twice length. (Putting my CuPy hat back, this is what we do.) We do have some sample codes in this repo for both cuTensorNet and cuStateVec that use MPI and can be easily tweaked to use NCCL, but I assume this is not what you're after?

wcqc commented 11 months ago

If I'm understanding everything correctly. Everything could be done with just PyTorch tensors and cuQuantum low-level API without resorting to CuPy or C/C++. However it was also mentioned above that CuPy is a required dependency in cuQuantum, this is slightly confusing as to when should I be using CuPy instead of plain PyTorch tensors? Are there any rule of thumbs? Am I losing something (maybe not immediately but later on) by not using CuPy?

leofang commented 11 months ago

Hi @wcqc, at least for low-level bindings I don't think it matters which array library you pick. All we need is a library providing containers that are GPU accessible. It can be CuPy, PyTorch, or even just CUDA Python if you prefer to do all low-level management yourself. Just like when coding in C/C++ and calling our C APIs, it does not matter if you manage raw GPU memory using cudaMalloc/cudaMallocAsync explicitly, or you allocate one using thrust::device_vector or even a custom container. All of them will get the job done, assuming you code it right.

CuPy was chosen to be a core dependency because it's lightweight (wheel < 100 MB), much easier to install (pip/conda installable), and pythonic (NumPy-like). It also exposes custom CUDA binding + a large part of the CUDA programming model. But for your use cases I doubt the choice matters.

leofang commented 11 months ago

Since the primary question was answered, and this issue is not really a bug report but a Q&A, let me move this to our discussion board and mark as answered. Do feel free to reach out if you have any specific question