cuQuantum multi-node example

PennyLaneAI / pennylane-lightning-gpu

GPU enabled Lightning simulator for accelerated circuit simulation. See https://github.com/PennyLaneAI/pennylane-lightning for all future development of this project.

https://docs.pennylane.ai/projects/lightning/en/stable/

Apache License 2.0

49 stars 10 forks source link

cuQuantum multi-node example #91

Open osbama opened 1 year ago

osbama commented 1 year ago

Issue description

Clarification, and preferably an example regarding if lightning.gpu has cuQuantum multi-node capability. Can I access multiple gpus spanning multiple nodes in a HPC? How?

Additional information

cuQuantum seems to have implemented a multi-node implementation. The scaling seems quite nice. Can I use pennylane like this?

https://developer.nvidia.com/blog/best-in-class-quantum-circuit-simulation-at-scale-with-nvidia-cuquantum-appliance/

mlxd commented 1 year ago

Hi @osbama We do not currently support custatevec's multi-node capabilities for a single state-vector computation.

lightning,gpu supports batched gradient evaluation for multiple observables over GPUs on a given node (see Parallel adjoint differentiation support:).

We have run distributed multi-GPU computations as part of circuit-cutting workloads, where a given high-qubit statevector problem is run over many (in this case 128 GPUs). See our paper for more information on how we did this for QAOA optimization problems, or this talk for how this was run on NERSC's Perlmutter supercomputer.

For a single distributed state-vector, we do plan to add this support natively to PennyLane in future quarters. In addition, for hybrid classical-quantum distributed work, we have a demonstration that ran on AWS BraKet

If you have a specific workload in mind that is not a single distributed statevector, we may be able to offer some suggestions on how to approach this with the existing tooling.

osbama commented 1 year ago

Thank you very much for the detailed answer. The references will be extremely useful.

We are working towards implementing an extended-Hubbard like correction to standard density functional theory kernels using QPU and machine learning, (at the moment we are exploring classical shadows).

A distributed state-vector would be great in the future (especially if it is CV), but there are well-known strategies to distribute aspects of this task (i.e. k-point parallelization) . I would very much appreciate if you can provide me with some examples where I can pass (frozen) segments of the full state-vector, or maybe partially contracted observables to other instances of Pennylane efficiently in a HPC environment, just to estimate how much difference a more "detailed" model Hamiltonian running in QPU will make to the overall DFT calculation.

At the moment I am using mpi4py, however I am not an expert in optimizing the communications or Python in HPC. If Pennylane or a module already has an efficient implementation of this, that will save us considerable resources.

mlxd commented 1 year ago

Hi @osbama We have had good success using both Ray and Dask Distributed/Dask CUDA for these task-based work-loads. For example, we used circuit-cutting (TN+quantum circuit hybrid) with parameter-shift to handle partitioning of a large problem space into smaller qubit chunks that can fit on an A100 GPU, and running these chunks concurrently --- in our case we used 128 GPUs on NERSC's Perlmutter supercomputer.

The paper is here and the example code is https://github.com/XanaduAI/randomized-measurements-circuit-cutting. This may not be a perfect match for your intentions, but should help to define the needs for distribution of components.

It is possible to use mpi4py for this, with a little less overhead, but a little more code. However, letting Ray/Dask handle the runtime and distribution of the components allowed us to concentrate on the problem itself, without too much concern of the environment it ran in.