Closed vincentmr closed 2 months ago
Hello. You may have forgotten to update the changelog! Please edit .github/CHANGELOG.md with:
Attention: Patch coverage is 85.91549%
with 10 lines
in your changes missing coverage. Please review.
Project coverage is 97.40%. Comparing base (
00ebcdf
) to head (f4b8425
). Report is 1 commits behind head on master.
Files with missing lines | Patch % | Lines |
---|---|---|
pennylane_lightning/core/_serialize.py | 66.66% | 9 Missing :warning: |
...ne_lightning/lightning_kokkos/_adjoint_jacobian.py | 0.00% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I will re-trigger your CIs as there was some problem with the test pypi.
Before submitting
Please complete the following checklist when submitting a PR:
[x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the
tests
directory![x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running
make docs
.[x] Ensure that the test suite passes, by running
make test
.[x] Add a new entry to the
.github/CHANGELOG.md
file, summarizing the change and including a link back to the PR.[x] Ensure that code is properly formatted by running
make format
.When all the above are checked, delete everything above the dashed line and fill in the pull request template.
Context: Parallelizing over observables can accelerate adjoint Jacobian calculations' backward pass. This PR revisits our implementation for L-Qubit and L-GPU which are the two devices that support it. Certain observables like Hamiltonian, PauliSentence, and LinearCombination can be split into many observables, enabling the distribution of the cost of expectation value computation. This strategy is initiated by the serializer which partitions the observables if
split_obs
is notFalse
. The serializer proceeds to a complete partitioning, meaning a 1000-PauliWord PauliSentence is partitioned into a 1000 PauliWords. We note in passing that L-Qubit does not split observables since it does not pass asplit_obs
value to_process_jacobian_tape
. This is wasteful because we end up with either of two situations:We explore chunking instead of full partitioning for LinearCombination-like objects, meaning a 1000-PauliWord PauliSentence is partitioned into four 250-PauliWords PauliSentences if we parallelize over 4 processes.
Description of the Change: Modify the serializer to chunk LinearCombination-like objects if
self.split_obs
is truthy. Correctly route_batch_obs
such that L-Qubit splits observables. Enhance/adapt tests.Analysis: Lightning-Qubit
applyObservable
is a bottleneck for somewhat large linear combinations (say 100s or 1000s of terms). Chunking isn't helpful for a circuit likebecause L-Qubit's
applyObservable
method is parallelized over terms for a singleHamiltonian
observable. Chunking in this case is counter-productive because it requires extra state vectors, extra backward passes, etc.For a circuit like however
applyObservable
is parallelized over observables, which only scales up to 2 threads, and with poor load-balance. In this case, it is better to split the observable, which is what the current changes do.So for this circuit the current PR yields speeds-up ranging from 1.5x to >2x by using obs-batching + chunking (compared with the previous obs-batching).
Lightning-GPU
Lightning-GPU splits the observables as soon as
batch_obs
is true. The current code splits a Hamiltonian into all its individual terms, which is quite inefficient and induces a lot of redundant backward passes. This is visible benchmarking the circuitThe batched L-GPU runs are using 2 x A100 GPUs on ISAIC. The speed-ups for batched versus serial are OK, but most important is the optimization of
Hamiltonian::applyInPlace
which brings about nice speed-ups between master and this PR.Related GitHub Issues: