PennyLaneAI / pennylane-lightning

The PennyLane-Lightning plugin provides a fast state-vector simulator written in C++ for use with PennyLane
https://docs.pennylane.ai/projects/lightning
Apache License 2.0
92 stars 40 forks source link

Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

Closed vincentmr closed 2 months ago

vincentmr commented 2 months ago

Before submitting

Please complete the following checklist when submitting a PR:

When all the above are checked, delete everything above the dashed line and fill in the pull request template.


Context: Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if split_obs is not False. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass a split_obs value to _process_jacobian_tape. This is wasteful because we end up with either of two situations:

We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes.

Description of the Change: Modify the serializer to chunk LinearCombination-like objects if self.split_obs is truthy. Correctly route _batch_obs such that L-Qubit splits observables. Enhance/adapt tests.

Analysis: Lightning-Qubit

applyObservable is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit like

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)

because L-Qubit's applyObservable method is parallelized over terms for a single Hamiltonian observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc.

For a circuit like however

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return np.array([qml.expval(ham), qml.expval(qml.PauliZ(0))])

applyObservable is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do.

mol master-serial master-batched chunk-serial chunk-batched
CH4 1.793e+01 1.330e+01 1.819e+01 8.040e+00
Li2 5.333e+01 3.354e+01 5.289e+01 1.839e+01
CO 9.817e+01 5.945e+01 9.619e+01 2.559e+01
H10 1.220e+02 7.317e+01 1.182e+02 3.305e+01

So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching).

Lightning-GPU

Lightning-GPU splits the observables as soon as batch_obs is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuit

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)
mol master-serial master-batched chunk-serial chunk-batched
CH4 1.463e+01 forever 5.583e+00 3.405e+00
Li2 1.201e+01 forever 5.284e+00 2.658e+00
CO 2.357e+01 forever 4.716e+00 4.577e+00
H10 2.992e+01 forever 5.476e+00 5.469e+00
HCN 8.622e+01 forever 3.144e+01 2.452e+01

The batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of Hamiltonian::applyInPlace which brings about nice speed-ups between master and this PR.

Related GitHub Issues:

github-actions[bot] commented 2 months ago

Hello. You may have forgotten to update the changelog! Please edit .github/CHANGELOG.md with:

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 85.91549% with 10 lines in your changes missing coverage. Please review.

Project coverage is 97.40%. Comparing base (00ebcdf) to head (f4b8425). Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pennylane_lightning/core/_serialize.py 66.66% 9 Missing :warning:
...ne_lightning/lightning_kokkos/_adjoint_jacobian.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #873 +/- ## ========================================== + Coverage 88.10% 97.40% +9.30% ========================================== Files 92 222 +130 Lines 11764 30715 +18951 ========================================== + Hits 10365 29919 +19554 + Misses 1399 796 -603 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

AmintorDusko commented 2 months ago

I will re-trigger your CIs as there was some problem with the test pypi.