Open Kenny-Heitritter opened 3 months ago
Thank you very much for this bug report, @Kenny-Heitritter. We just released version 0.7.0, can you please tell us if the problem is more likely or less likely to occur on 0.7.0? There are no direct fixes for this issue in 0.7.0, but the timing likely changed, so it would be good to know if we should focus our debug efforts on a specific version or not.
A few items to note:
The Docker image isn't quite on our main channel yet, so please use nvcr.io/nvidia/nightly/cuda-quantum:0.7.0
(which includes nightly
in the path). This will only be necessary for the next week or so and then you will see it on the main channel.
The Python UCCSD API changed slightly in 0.7.0, so you'll need to apply this change to your test script.
--- test_0.6.0.py 2024-03-20 13:21:03.138949476 +0000
+++ test_0.7.0.py 2024-03-20 13:31:09.739183293 +0000
@@ -128,7 +128,7 @@
for i in range(nelec):
kernel.x(qubits[i])
-cudaq.kernels.uccsd(kernel, qubits, thetas, nelec, qubits_num)
+kernel.apply_call(cudaq.kernels.uccsd, qubits, thetas, nelec, qubits_num)
parameter_count = cudaq.kernels.uccsd_num_parameters(nelec,qubits_num)
Thanks @bmhowe23! Just tested the same test script, modulo the new UCCSD API shown above, and it does appear the issue is present to the same degree in 0.7.0. Please do let me know if there are any other tests I can run which would be helpful.
@Kenny-Heitritter I am still trying to reproduce the issue on servers that I have access to (unsuccessfully thus far), but if you would like to try https://github.com/NVIDIA/cuda-quantum/pkgs/container/cuda-quantum-dev/196615788?tag=pr-1444-base [edit: see new link in comment below] on your system, please feel free. This is a "cuda-quantum-dev" image, so it will slightly different than a "cuda-quantum" image, but I think you should be able to run any C++/Python examples that you place in the container, just like normal. One notable difference is that the binaries are installed in /usr/local/cudaq instead of /opt/nvidia/cudaq. Hopefully that doesn't matter to you.
@Kenny-Heitritter I am still trying to reproduce the issue on servers that I have access to (unsuccessfully thus far), but if you would like to try https://github.com/NVIDIA/cuda-quantum/pkgs/container/cuda-quantum-dev/196615788?tag=pr-1444-base on your system, please feel free. This is a "cuda-quantum-dev" image, so it will slightly different than a "cuda-quantum" image, but I think you should be able to run any C++/Python examples that you place in the container, just like normal. One notable difference is that the binaries are installed in /usr/local/cudaq instead of /opt/nvidia/cudaq. Hopefully that doesn't matter to you.
The old link expired, so here is a new one: https://github.com/NVIDIA/cuda-quantum/pkgs/container/cuda-quantum-dev/200241787?tag=pr-1444-base
@Kenny-Heitritter We've seen some positive results from this image and will likely include the change in this image in our next release. Feel free to test it out if you'd like: https://github.com/NVIDIA/cuda-quantum/pkgs/container/cuda-quantum-dev/235623747?tag=pr-1444-base.
(Thanks @jfriel-oqc!)
Required prerequisites
Describe the bug
When running VQEs requiring larger amounts of memory from within the CUDA Quantum docker container (v0.6.0) on NVIDIA GH200, there is an increasing chance of getting the following error:
Steps to reproduce the bug
Expected behavior
The code should run without producing an error.
Is this a regression? If it is, put the last known working version (or commit) here.
Not a regression
Environment
Suggestions
No response