NVIDIA / cuda-quantum

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
https://nvidia.github.io/cuda-quantum/
Other
552 stars 188 forks source link

anyon target is not stable in CI #2249

Open schweitzpgi opened 1 month ago

schweitzpgi commented 1 month ago

It appears that the anyon target is not stable in the CI. Various tests will fail to verify when this target is selected and report that the standard output of the process is empty. We believe this is an undiscovered bug somewhere in this target's implementation.

As a workaround, we're removing this target from target tests on an ad hoc basis to keep the CI working in a deterministic way.

bmhowe23 commented 1 month ago

Just leaving this here for when we tackle the underlying root cause. #2253 removed some of the tests, so refer to that PR when adding the tests back in.

bmhowe23 commented 1 month ago

@schweitzpgi - is it possible this is related to https://github.com/NVIDIA/cuda-quantum/issues/1712?

schweitzpgi commented 1 month ago

@schweitzpgi - is it possible this is related to #1712?

Might be. These failures were popping up for @khalatepradnya and @sacpis and myself somewhat randomly in the CI and about 50% of the time. The failure mode was that there was nothing appearing on the stdout stream. As it was hard to reproduce locally (smaller machine, fewer threads?), it was backed out rather than forcing everyone to run the CI twice per rebase on average.

Lambdauv commented 1 month ago

just want to note that the CI tests were done multiple times without failures during the several days trying to merge the initial Anyon PR. is it possible that PRs after Sept 10 added something that increased the parallel workload in building the installer? It could be that simultaneously building multiple components that also build the same libraries in the processes repeatedly might cause problems for OS on smaller machines [inspired by discussions on sporadic build errors seen here ].

bmhowe23 commented 1 month ago

just want to note that the CI tests were done multiple times without failures during the several days trying to merge the initial Anyon PR. is it possible that PRs after Sept 10 added something that increased the parallel workload in building the installer? It could be that simultaneously building multiple components that also build the same libraries in the processes repeatedly might cause problems for OS on smaller machines [inspired by discussions on sporadic build errors seen here ].

It's true that it ran multiple times in the CI while it was still a PR, so something that came in afterwards could've contributed to the problem. For what it's worth, I don't think the problem is unique to the Anyon target ... this example shows a failure in Remote-Sim/state_amplitude.cpp.