C++ segfault when passing callable kernel to another kernel from library

amccaskey commented 3 weeks ago

Take the following files

// lib.h 
#include "cudaq.h"

void kernel(cudaq::qvector<>& q);

// lib.cpp 
#include "lib.h"

__qpu__ void kernel(cudaq::qvector<>& q) {
    x(q[0]);
}

and

// user.cpp
#include "lib.h"

__qpu__ void userKernel(const std::function<void(cudaq::qvector<> &)> &init) {
  cudaq::qvector q(2);
  init(q);
}

int main() { userKernel(kernel); }

Compile and link with the following

nvq++ --enable-mlir -fPIC -c lib.cpp -o lib.o
nvq++ --enable-mlir -fPIC lib.o user.cpp 
# The run 
CUDAQ_LOG_LEVEL=info ./a.out

This results in a segmentation fault.

Can anyone else reproduce this? I would be very thankful for anyone's help on this one. This kind of pattern will be a primary feature of future downstream libraries.

Another variation would be this

// lib.h 
#include "cudaq.h"

std::function<void(cudaq::qvector<>&)> get_kernel();//cudaq::qvector<>& q);

// lib.cpp
#include "lib.h"

__qpu__ void kernel(cudaq::qvector<> &q) { x(q[0]); }

std::function<void(cudaq::qvector<> &)> get_kernel() { return kernel; }

#include "lib.h"

__qpu__ void userKernel(const std::function<void(cudaq::qvector<> &)> &init) {
  cudaq::qvector q(2);
  init(q);
}

int main() { userKernel(get_kernel()); }

sacpis commented 3 weeks ago

I am able to reproduce a segmentation fault with this first example.

root@ea401e2-lcedt:/workspaces/cuda-quantum/examples/cpp# CUDAQ_LOG_LEVEL=info ./a.out 
[2024-08-20 00:13:39.899] [info] [PluginUtils.h:24] Requesting N5cudaq16quantum_platformE plugin via symbol name getQuantumPlatform.
[2024-08-20 00:13:39.899] [info] [PluginUtils.h:36] Successfully loaded the plugin.
[2024-08-20 00:13:39.899] [info] [PluginUtils.h:24] Requesting N5nvqir16CircuitSimulatorE plugin via symbol name getCircuitSimulator.
[2024-08-20 00:13:39.899] [info] [PluginUtils.h:36] Successfully loaded the plugin.
[2024-08-20 00:13:39.942] [info] [NVQIR.cpp:82] Creating the custatevec-fp32 backend.
[2024-08-20 00:13:39.942] [info] [CircuitSimulator.h:901] Allocating 2 new qubits.
[2024-08-20 00:13:39.942] [info] [CuStateVecCircuitSimulator.cpp:170] GPU 0 Allocating new qubit array of size 2.
Segmentation fault (core dumped)

1tnguyen commented 3 weeks ago

This could be a bridge issue in handling the const std::function<void(cudaq::qvector<> &)> &init argument.

Looking at the generated code:

define void @_Z10userKernelRKSt8functionIFvRN5cudaq7qvectorILm2EEEEE({ i8*, i8* } %0) local_unnamed_addr

as compared to the LLVM one:

 define linkonce_odr dso_preemptable void @_Z10userKernelRKSt8functionIFvRN5cudaq7qvectorILm2EEEEE(ptr noundef nonnull align 8 dereferenceable(32) %init) #5 personality ptr @__gxx_personality_v0 !dbg !3373

For some reason the argument is interpreted as a pair of pointers? This wrong argument assumption will crash the argsCreator later.

Compiling the app in library mode (lib.o was still compiled with MLIR mode) is okay; hence it's likely the problem.

@schweitzpgi Do we support std::function arguments yet?

amccaskey commented 3 weeks ago

@schweitzpgi I see this in ConvertCCToLLVM.cpp

void cudaq::opt::populateCCTypeConversions(LLVMTypeConverter *converter) {
  converter->addConversion([](cc::CallableType type) {
    return lambdaAsPairOfPointers(type.getContext());
  });
  ...
}

Looks like this is setup for just lambdas?

amccaskey commented 3 weeks ago

This is also interesting

define { i8*, i64 } @function_kernel_to_sample._Z16kernel_to_sampleRKSt8functionIFvRN5cudaq7qvectorILm2EEEEE.thunk(i8* nocapture readnone %0, i1 %1) {
  %3 = tail call %Array* @__quantum__rt__qubit_allocate_array(i64 2)
  unreachable
}
define i64 @function_kernel_to_sample._Z16kernel_to_sampleRKSt8functionIFvRN5cudaq7qvectorILm2EEEEE.argsCreator(i8** nocapture readnone %0, i8** nocapture writeonly %1) #2 {
...

Just a guess, but could this be why we see a seg fault in the argsCreator function? The thunk is getting called by altLaunchKernel, and we hit this unreachable line, with the next spot in memory the argscreator ???

amccaskey commented 3 weeks ago

Here's a test repo for all this

https://github.com/amccaskey/test_cudaq_cpp_py_integration

mkdir build && cd build 
cmake .. -G Ninja -DCUDAQ_DIR=/path/to/cudaq/lib/cmake/cudaq -DCMAKE_BUILD_TYPE=Debug 
ninja 
PYTHONPATH=/path/to/cudaq:$PWD gdb --args python3-dbg test.py

schweitzpgi commented 3 weeks ago

Thanks for the heads-up. I'll add this to my list to look at.

schweitzpgi commented 3 weeks ago

This may be interesting.

% nvq++ --enable-mlir -fkernel-exec-kind=2 -fPIC -g -c lib.cpp -o lib.o
% nvq++ --enable-mlir -fkernel-exec-kind=2 -g -fPIC  lib.o user.cpp
% ./a.out
terminate called after throwing an instance of 'std::runtime_error'
  what():  Wrong kernel launch point: Attempt to launch kernel in streamlined for JIT mode on local simulated QPU. This is not supported.
Aborted
%

NVIDIA / cuda-quantum

C++ segfault when passing callable kernel to another kernel from library #2110