PennyLaneAI / pennylane-lightning

The PennyLane-Lightning plugin provides a fast state-vector simulator written in C++ for use with PennyLane
https://docs.pennylane.ai/projects/lightning
Apache License 2.0
82 stars 35 forks source link

[BUG] Synchronization needed in `lightning.gpu` with cuStateVec 1.6.0 #793

Closed dmclark17 closed 1 month ago

dmclark17 commented 1 month ago

Issue description

I am observing Bus errors and Segfaults when running the lightning.gpu backend with the latest version of cuStateVec (1.6.0). The issue is not reproduced with older versions of cuStateVec and is resolved by setting CUDA_LAUNCH_BLOCKING=1.

The issue seems to be happening in the custatevecComputeExpectation library call here:

https://github.com/PennyLaneAI/pennylane-lightning/blob/be07bbeeaf931b4841098cdf9248d213249d9bb7/pennylane_lightning/core/src/simulators/lightning_gpu/measurements/MeasurementsGPU.hpp#L815-L823

Because the cuStateVec APIs are asynchronous (more info here), the custatevecComputeExpectation call returns before the expect value has been written. The errors are caused by cuStateVec writing to the address after expect has gone out of scope.

The issue is resolved by adding a stream synchronization after the custatevecComputeExpectation call. Should I submit a merge request with this change? Thanks!

mlxd commented 1 month ago

Thanks @dmclark17 for the report on this. It would be great if you have a patch for the custatevec calls (assuming more than just calls to cudaDeviceSynchronize) that you can post here --- we can immediately push out a bugfix for the latest version. Also, if you have a representative minimum example to reproduce the failure mode we can use to validate the issue, that'd be really helpful. We can ensure this is tested in our CI pipeline (which don't seem to be hitting the issue thus far).

dmclark17 commented 1 month ago

A cudaDeviceSynchronize call after the custatevecComputeExpectation should be sufficient. We could also just synchronize the stream being used by cuStateVec; it looks like cuStateVec is using the default stream, as custatevecSetStream is not used, so these should have a similar effect:

// compute expectation
...

// wait for expectation value to be ready
cudaStream_t stream_id;
PL_CUSTATEVEC_IS_SUCCESS(custatevecGetStream(this->_statevector.getCusvHandle(), &stream_id));
PL_CUDA_IS_SUCCESS(cudaStreamSynchronize(stream_id));

// Free workspace
...

I encounter the error when computing an adjoint Jacobian; however, it seems like this only impacts Grace Hopper systems as I wasn't able to reproduce on an x86-hosted A100:

    # Create QNode of device and circuit
    @qml.qnode(dev, diff_method="adjoint")
    def circuit(parameters):
        qml.StronglyEntanglingLayers(weights=parameters, wires=range(wires))
        return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(wires)])

    # Set trainable parameters for calculating circuit Jacobian
    shape = qml.StronglyEntanglingLayers.shape(n_layers=layers, n_wires=wires)
    weights = qml.numpy.random.random(size=shape)

    # Warm up
    jac = qml.jacobian(circuit)(weights)
tomlqc commented 1 month ago

Thanks @dmclark17 for the hint and the example.

dmclark17 commented 1 month ago

No problem! Let me know if I can provide any additional information, or if I should submit a pull request!

CatalinaAlbornoz commented 1 month ago

We'll let you know @dmclark17 !

mlxd commented 1 month ago

Just an FYI @dmclark17 the fixes should be available now as with PR #823