Add native PauliRot implementation in LightningKokkos [sc-71642]

Before submitting

Please complete the following checklist when submitting a PR:

[x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the tests directory!
[x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running make docs.
[x] Ensure that the test suite passes, by running make test.
[x] Add a new entry to the .github/CHANGELOG.md file, summarizing the change, and including a link back to the PR.
[x] Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed line and fill in the pull request template.

Context: Pauli rotations come up in many places, and importantly in the time evolution of qchem Hamiltonians. It is therefore worth considering ways to accelerate their execution.

Description of the Change: Implement applyPauliRot. Invoke applyPauliRot directly from the SV class and add bindings to the Python layer.

Benefits: Faster Pauli rotations. I performed a benchmark on random PauliRotations (runtime > 1.0 sec and at least 5 of them) through the Python layer. The data remains noisy with 5 samples because the performance varies depending on the specific "XYZ" sequence (which translates into more or less predictable memory access patterns). Overall, we see an advantage for 3+ qubits and up.

speedup_vs_ntargets_lk_omp16

I performed the same benchmark on an A100 card with the Kokkos-CUDA backend, but using at least 500 samples since the absolute timings quite small and get the following speed-ups.

speedup_vs_ntargets_lk_cuda

Using a full workflow such as

    @qml.qnode(dev, diff_method=None)
    def circuit():
        qml.TrotterProduct(ham, time=1.0, n=1, order=2)
        return qml.state()

to benchmark, we obtain timings as follows

time_vs_mol

For large enough molecules (>= 20 qubits, >= 1000 terms), the new PauliRot kernels have a clear advantage which only grows with molecular size. It is worth noting that with L-Kokkos-CUDA, even at the (24/10k) scale, evaluating the circuit is not the main bottleneck which is why it takes about the same time simulating HCN (2.64 sec. apply_lightning vs 32.5 sec. QNode) and N2N2 (7.51 sec. apply_lightning vs 36.4 sec. QNode).

Possible Drawbacks:

Related GitHub Issues: [sc-69801]

Codecov Report

Attention: Patch coverage is 88.63636% with 10 lines in your changes missing coverage. Please review.

Project coverage is 97.29%. Comparing base (d5ffb0c) to head (391d0da). Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...nylane_lightning/lightning_kokkos/_state_vector.py	0.00%	6 Missing :warning:
...ane_lightning/lightning_kokkos/lightning_kokkos.py	20.00%	4 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #855 +/- ## ========================================== + Coverage 96.24% 97.29% +1.04% ========================================== Files 212 168 -44 Lines 28109 21118 -6991 ========================================== - Hits 27054 20547 -6507 + Misses 1055 571 -484 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

PennyLaneAI / pennylane-lightning

Add native PauliRot implementation in LightningKokkos [sc-71642] #855

Before submitting

Codecov Report