Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680]

vincentmr commented 2 months ago

Before submitting

Please complete the following checklist when submitting a PR:

[x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the tests directory!
[x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running make docs.
[x] Ensure that the test suite passes, by running make test.
[x] Add a new entry to the .github/CHANGELOG.md file, summarizing the change and including a link back to the PR.
[x] Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed line and fill in the pull request template.

Context: Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if split_obs is not False. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass a split_obs value to _process_jacobian_tape. This is wasteful because we end up with either of two situations:

The Jacobian is computed N processes (threads, devices, etc.) at a time which results in a lot of duplicate computation (forward/backward passes are repeated and the results combined);
The Jacobian is parallelized over all observables, each of which requires a state vector copy which increases the memory requirements by as much.

We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes.

Description of the Change: Modify the serializer to chunk LinearCombination-like objects if self.split_obs is truthy. Correctly route _batch_obs such that L-Qubit splits observables. Enhance/adapt tests.

Analysis: Lightning-Qubit

applyObservable is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit like

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)

because L-Qubit's applyObservable method is parallelized over terms for a single Hamiltonian observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc.

For a circuit like however

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return np.array([qml.expval(ham), qml.expval(qml.PauliZ(0))])

applyObservable is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do.

mol	master-serial	master-batched	chunk-serial	chunk-batched
CH4	1.793e+01	1.330e+01	1.819e+01	8.040e+00
Li2	5.333e+01	3.354e+01	5.289e+01	1.839e+01
CO	9.817e+01	5.945e+01	9.619e+01	2.559e+01
H10	1.220e+02	7.317e+01	1.182e+02	3.305e+01

So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching).

Lightning-GPU

Lightning-GPU splits the observables as soon as batch_obs is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuit

    @qml.qnode(dev, diff_method="adjoint")
    def c(weights):
        qml.templates.AllSinglesDoubles(weights, wires, hf_state, singles, doubles)
        return qml.expval(ham)

mol	master-serial	master-batched	chunk-serial	chunk-batched
CH4	1.463e+01	forever	5.583e+00	3.405e+00
Li2	1.201e+01	forever	5.284e+00	2.658e+00
CO	2.357e+01	forever	4.716e+00	4.577e+00
H10	2.992e+01	forever	5.476e+00	5.469e+00
HCN	8.622e+01	forever	3.144e+01	2.452e+01

The batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of Hamiltonian::applyInPlace which brings about nice speed-ups between master and this PR.

Related GitHub Issues:

github-actions[bot] commented 2 months ago

Hello. You may have forgotten to update the changelog! Please edit .github/CHANGELOG.md with:

A one-to-two sentence description of the change. You may include a small working example for new features.
A link back to this PR.
Your name (or GitHub username) in the contributors section.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 85.91549% with 10 lines in your changes missing coverage. Please review.

Project coverage is 97.40%. Comparing base (00ebcdf) to head (f4b8425). Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
pennylane_lightning/core/_serialize.py	66.66%	9 Missing :warning:
...ne_lightning/lightning_kokkos/_adjoint_jacobian.py	0.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #873 +/- ## ========================================== + Coverage 88.10% 97.40% +9.30% ========================================== Files 92 222 +130 Lines 11764 30715 +18951 ========================================== + Hits 10365 29919 +19554 + Misses 1399 796 -603 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

AmintorDusko commented 2 months ago

I will re-trigger your CIs as there was some problem with the test pypi.

PennyLaneAI / pennylane-lightning

Chunk Hamiltonian, PauliSentence, LinearCombination [sc-65680] #873

Before submitting

Codecov Report