Open chaeyeunpark opened 1 year ago
Yea, this was always a tough problem. It depends on CPU model, observable type, and even circuit type. We could try updating the OpenMP scheduling here in the section, and seeing if it works nicely.
for defaults though, at least until we better understand the path we need, I'd say favouring large-scale CPUs (HPC systems, AWS Braket servers) would be better as the default.
For computing gradient for a circuit with expectation values of a Hamiltonian object, Lightning implements OpenMP parallelized function that distributes Hamiltonian terms to threads:
https://github.com/PennyLaneAI/pennylane-lightning/blob/58c9e1c66e2d781f7f6547ef19c2a27f6ecc3f03/pennylane_lightning/src/simulator/Observables.hpp#L309-L347
However, the
Util::scaleAndAdd
function calls OpenBLAS'scblas_caxpy
orcblas_zaxpy
when compiled with OpenBLAS, which is the case for the PyPI-provided wheels. As these functions are parallelized internally by OpenBLAS, turning off the internal parallelization of OpenBLAS might be necessary to prevent threads oversubscription (or vice versa).Edit: Indeed, it's subtle. Locally, I found that turning off OpenBLAS parallelism is better performing, but the opposite in Perlmutter.