NVIDIA / cuda-quantum

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
https://nvidia.github.io/cuda-quantum/
Other
424 stars 147 forks source link

Misleading error (JIT compilation issue) for remote-mqpu backend when MPI plugin is not activated #1281

Open bettinaheim opened 4 months ago

bettinaheim commented 4 months ago

Required prerequisites

Describe the bug

In some cases, the execution on the remote-mqpu backend fails with a JIT error along the lines of JIT session error: Symbols not found: [ _Unwind_Resume, _ZNSaIcED2Ev, ...]

The error is caused by the invokeWrappedKernel logic in /runtime/common/JIT.cpp. Specifically, I think we are running into something like this: https://stackoverflow.com/questions/57612173/llvm-jit-symbols-not-found The _Unwind_Resume symbol is from the GNU C++ standard library, specifically from libsupc++.a. I double checked that the produced executable itself (a.out) contains that symbol, so I suspect it is indeed something about these lines that is not working as expected:

  // Resolve symbols that are statically linked in the current process.
  llvm::orc::JITDylib &mainJD = jit->getMainJITDylib();
  mainJD.addGenerator(llvm::cantFail(
      llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(
          dataLayout.getGlobalPrefix())));

Steps to reproduce the bug

Minimal repro: Download the latest version of the CUDA Quantum installer for C++, or build it from source. Then run

container=`docker run -itd --rm ubuntu:22.04`
docker cp install_cuda_quantum.$(uname -m) $container:/tmp/
docker cp docs/sphinx/examples/cpp/algorithms/amplitude_estimation.cpp $container:/tmp
docker attach $container
apt-get update && apt-get install -y --no-install-recommends \
            wget ca-certificates libstdc++-11-dev libopenmpi-dev
chmod +x /tmp/install_cuda_quantum.x86_64
/tmp/install_cuda_quantum.x86_64 --accept && . /etc/profile

# Fails with the reported JIT exception during execution:
nvq++ --target remote-mqpu /tmp/amplitude_estimation.cpp && ./a.out
objdump -T a.out | grep -i unwind # shows _Unwind_Resume exists

# Works fine:
nvq++ --target remote-mqpu /tmp/amplitude_estimation.cpp --enable-mlir && ./a.out

The installer can be build from source by building the cuda-quantum-assets: docker build -t cuda-quantum-assets:latest -f docker/build/assets.Dockerfile . and then building the installer: DOCKER_BUILDKIT=1 docker build -f docker/release/installer.Dockerfile --build-arg base_image=cuda-quantum-assets:latest . --output out

Expected behavior

The example should compile and run without error.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

Suggestions

No response

bettinaheim commented 4 months ago

Of course, the moment I actually write down the exact repro, it occurs to me what is ultimately causing the issue is the missing MPI plugin:

Proceed as above, but then

 export MPI_PATH=/usr/lib/x86_64-linux-gnu/openmpi
 bash $CUDA_QUANTUM_PATH/distributed_interfaces/activate_custom_mpi.sh
 nvq++ --target remote-mqpu /tmp/amplitude_estimation.cpp && ./a.out # now works (though why is it printing the llvm::dbgs() messages? - that's not nice...)
bettinaheim commented 4 months ago

Edit nr 2: I quickly tried out if I at least get a decent/comprehensive error when I don't have MPI installed at all. Unfortunately, the compilation succeeds and I get pretty much the same error as above, which is not really comprehensive.

Options for resolution: 1) Require MPI to be installed to use the remote-mqpu backend. In that case, we need to document this and add a compilation check to give a nice comprehensive error along the lines of "This target requires MPI. Please install MPI and try again." when MPI is missing. 2) Not require MPI and do the same as we do for the nvidia-mqpu target. I think this in principle is what we do, and I think that is the better option.