PennyLaneAI / pennylane-lightning

The PennyLane-Lightning plugin provides a fast state-vector simulator written in C++ for use with PennyLane
https://docs.pennylane.ai/projects/lightning
Apache License 2.0
78 stars 33 forks source link

Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455

Open chaeyeunpark opened 1 year ago

chaeyeunpark commented 1 year ago

For computing gradient for a circuit with expectation values of a Hamiltonian object, Lightning implements OpenMP parallelized function that distributes Hamiltonian terms to threads:

https://github.com/PennyLaneAI/pennylane-lightning/blob/58c9e1c66e2d781f7f6547ef19c2a27f6ecc3f03/pennylane_lightning/src/simulator/Observables.hpp#L309-L347

However, the Util::scaleAndAdd function calls OpenBLAS's cblas_caxpy or cblas_zaxpy when compiled with OpenBLAS, which is the case for the PyPI-provided wheels. As these functions are parallelized internally by OpenBLAS, turning off the internal parallelization of OpenBLAS might be necessary to prevent threads oversubscription (or vice versa).

Edit: Indeed, it's subtle. Locally, I found that turning off OpenBLAS parallelism is better performing, but the opposite in Perlmutter.

mlxd commented 1 year ago

Yea, this was always a tough problem. It depends on CPU model, observable type, and even circuit type. We could try updating the OpenMP scheduling here in the section, and seeing if it works nicely.

for defaults though, at least until we better understand the path we need, I'd say favouring large-scale CPUs (HPC systems, AWS Braket servers) would be better as the default.