Open alexfrolov opened 5 months ago
Cricket currently only supports C/R when you only use the runtime API. It looks like your checkpoint contains a call to a driver API function for which there is currently no C/R support. Are you able to share the code? How have you launched the application and how have you created the checkpoint?
Hi!
By runtime API do you mean the ./gpu part of the Cricket? Yes, I was able to generate a checkpoint for one of your samples (probably, it was test_apps/matmul.cu).
Do you have any plans to add a support for C/R for a "cpu" mode ?
Best, Alex
Hey,
I mean the CUDA Runtime API (see https://docs.nvidia.com/cuda/cuda-runtime-api/index.html). Not supported is any function from the CUDA Driver API (see https://docs.nvidia.com/cuda/cuda-driver-api/index.html) You are getting the segfault in a Driver API call, because this function is supported for remote execution but not for checkpointing. It tries to restore something that was not saved to the checkpoint file.
Hi!
AFAIU, invoking of __cudaRegisterFunction
comes with NVCC generating the binary code. Is it possible to avoid it by using some options for nvcc
?
Have you linked to the CUDA libaries dynamically, i.e., using -cudart shared
as a nvcc option? If I remember correctly your error might happen if you link statically.
I also encounter this bug when compiling my simple CUDA application with shared cudart library. Seems like __cudaRegisterFunctio
is called via the rt library:
$ nm matrixMult.bin | grep cuda
0000000000001be4 t _Z16cudaLaunchKernelIcE9cudaErrorPKT_4dim3S4_PPvmP11CUstream_st
0000000000005048 b _ZL20__cudaFatCubinHandle
0000000000005070 b _ZL20__cudaFatCubinHandle
0000000000005050 b _ZL22__cudaPrelinkedFatbins
0000000000001b86 t _ZL24__sti____cudaRegisterAllv
000000000000191d t _ZL26__cudaUnregisterBinaryUtilv
0000000000001b20 t _ZL31__nv_cudaEntityRegisterCallbackPPv
0000000000005080 b _ZZL31__nv_cudaEntityRegisterCallbackPPvE5__ref
U __cudaInitModule@libcudart.so.12
U __cudaPopCallConfiguration@libcudart.so.12
U __cudaPushCallConfiguration@libcudart.so.12
U __cudaRegisterFatBinary@libcudart.so.12
U __cudaRegisterFatBinaryEnd@libcudart.so.12
U __cudaRegisterFunction@libcudart.so.12
0000000000001329 t __cudaUnregisterBinaryUtil
U __cudaUnregisterFatBinary@libcudart.so.12
U cudaFree@libcudart.so.12
U cudaLaunchKernel@libcudart.so.12
U cudaMalloc@libcudart.so.12
U cudaMemcpy@libcudart.so.1
Hi!
I want to try cricket for C/R in cpu mode (no in-kernel checkpointing). However, when I run restore it fails with segfault.
After a little debugging, I have found out that the problem comes from using
rpc_register_function_1_svc
in restore process (see gdb trace). In the comments it is said that it does not support checkpoint/restore. But I have not found how to avoid it, because it is called from the__cudaRegisterFunction
at the client side.Does it mean that C/R does not work in Cricket for cpu at the moment? Thank you!