Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS

PennyLaneAI / pennylane-lightning

The PennyLane-Lightning plugin provides a fast state-vector simulator written in C++ for use with PennyLane

Apache License 2.0

78 stars 33 forks source link

For computing gradient for a circuit with expectation values of a Hamiltonian object, Lightning implements OpenMP parallelized function that distributes Hamiltonian terms to threads:

https://github.com/PennyLaneAI/pennylane-lightning/blob/58c9e1c66e2d781f7f6547ef19c2a27f6ecc3f03/pennylane_lightning/src/simulator/Observables.hpp#L309-L347

However, the Util::scaleAndAdd function calls OpenBLAS's cblas_caxpy or cblas_zaxpy when compiled with OpenBLAS, which is the case for the PyPI-provided wheels. As these functions are parallelized internally by OpenBLAS, turning off the internal parallelization of OpenBLAS might be necessary to prevent threads oversubscription (or vice versa).

Edit: Indeed, it's subtle. Locally, I found that turning off OpenBLAS parallelism is better performing, but the opposite in Perlmutter.

PennyLaneAI / pennylane-lightning

Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455