RWTH-ACS / cricket

cricket is a virtualization solution for GPUs
MIT License
150 stars 39 forks source link

Segmentation violation during checkpoint restore (cpu mode) #49

Open alexfrolov opened 5 months ago

alexfrolov commented 5 months ago

Hi!

I want to try cricket for C/R in cpu mode (no in-kernel checkpointing). However, when I run restore it fails with segfault.

(gdb) r
Starting program: /home/alexndrfrolov/cricket/cpu/cricket-rpc-server 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
welcome to cricket!
+03:00:00.000003 INFO:  restoring previous state was enabled by setting CRICKET_RESTORE
+03:00:00.000146 DEBUG: restoring rpc_id from ckp/rpc_id
+03:00:00.000189 DEBUG: using prog=99, vers=1   in cpu-server.c:220
+03:00:00.000200 INFO:  using TCP...
+03:00:00.000766 INFO:  listening on port 49338
+03:00:00.001007 DEBUG: sched_none_init
[New Thread 0x7fffb47ff000 (LWP 2666702)]
+03:00:00.673881 DEBUG: restoring api records from ckp/api_records
+03:00:00.673948 DEBUG: function: 50 

Thread 1 "cricket-rpc-ser" received signal SIGSEGV, Segmentation fault.
0x00007fffb8381b1d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffb8381b1d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffb824dd31 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000055555557c922 in loggf (level=3 '\003', formatstr=0x5555555973d8 "rpc_register_function(fatCubinHandle: %p, hostFun: %p, deviceFun: %s, deviceName: %s, thread_limit: %d)") at log.c:98
#3  0x000055555557a38f in rpc_register_function_1_svc (fatCubinHandle=94419555140752, hostFun=94419554144212, deviceFun=0x56340424fd90 <error: Cannot access memory at address 0x56340424fd90>, 
    deviceName=0x5634044f2e90 <error: Cannot access memory at address 0x5634044f2e90>, thread_limit=-1, result=0x555555975c50, rqstp=0x7fffffffdfc0) at cpu-server-driver.c:111
#4  0x0000555555564913 in _rpc_register_function_1 (argp=0x555555972740, result=0x555555975c50, rqstp=0x7fffffffdfc0) at cpu_rpc_prot_svc_mod.c:46
#5  0x0000555555583534 in cr_call_record (record=0x555555974830) at cr.c:714
#6  0x0000555555583889 in cr_restore_resources (path=0x5555555963fb "ckp", record=0x555555974830, rm_memory=0x5555555a5d60 <rm_memory>, rm_streams=0x5555555a5ac0 <rm_streams>, rm_events=0x5555555a5c40 <rm_events>, 
    rm_arrays=0x5555555a61a0 <rm_arrays>, rm_cusolver=0x5555555a5be0 <rm_cusolver>, rm_cublas=0x5555555a5e80 <rm_cublas>) at cr.c:772
#7  0x0000555555583d55 in cr_restore (path=0x5555555963fb "ckp", rm_memory=0x5555555a5d60 <rm_memory>, rm_streams=0x5555555a5ac0 <rm_streams>, rm_events=0x5555555a5c40 <rm_events>, rm_arrays=0x5555555a61a0 <rm_arrays>, 
    rm_cusolver=0x5555555a5be0 <rm_cusolver>, rm_cublas=0x5555555a5e80 <rm_cublas>) at cr.c:870
#8  0x00005555555710c1 in server_runtime_restore (path=0x5555555963fb "ckp") at cpu-server-runtime.c:141
#9  0x0000555555570e3b in server_runtime_init (restore=1) at cpu-server-runtime.c:87
#10 0x000055555556ed54 in cricket_main (prog_num=99, vers_num=1) at cpu-server.c:284
#11 0x0000555555592752 in main (argc=1, argv=0x7fffffffe3d8) at server-exe.c:11

After a little debugging, I have found out that the problem comes from using rpc_register_function_1_svc in restore process (see gdb trace). In the comments it is said that it does not support checkpoint/restore. But I have not found how to avoid it, because it is called from the __cudaRegisterFunction at the client side.

Does it mean that C/R does not work in Cricket for cpu at the moment? Thank you!

n-eiling commented 5 months ago

Cricket currently only supports C/R when you only use the runtime API. It looks like your checkpoint contains a call to a driver API function for which there is currently no C/R support. Are you able to share the code? How have you launched the application and how have you created the checkpoint?

alexfrolov commented 5 months ago

Hi!

By runtime API do you mean the ./gpu part of the Cricket? Yes, I was able to generate a checkpoint for one of your samples (probably, it was test_apps/matmul.cu).

Do you have any plans to add a support for C/R for a "cpu" mode ?

Best, Alex

n-eiling commented 5 months ago

Hey,

I mean the CUDA Runtime API (see https://docs.nvidia.com/cuda/cuda-runtime-api/index.html). Not supported is any function from the CUDA Driver API (see https://docs.nvidia.com/cuda/cuda-driver-api/index.html) You are getting the segfault in a Driver API call, because this function is supported for remote execution but not for checkpointing. It tries to restore something that was not saved to the checkpoint file.

alexfrolov commented 5 months ago

Hi!

AFAIU, invoking of __cudaRegisterFunction comes with NVCC generating the binary code. Is it possible to avoid it by using some options for nvcc ?

n-eiling commented 5 months ago

Have you linked to the CUDA libaries dynamically, i.e., using -cudart shared as a nvcc option? If I remember correctly your error might happen if you link statically.

ya0guang commented 4 months ago

I also encounter this bug when compiling my simple CUDA application with shared cudart library. Seems like __cudaRegisterFunctio is called via the rt library:

$ nm matrixMult.bin | grep cuda
0000000000001be4 t _Z16cudaLaunchKernelIcE9cudaErrorPKT_4dim3S4_PPvmP11CUstream_st
0000000000005048 b _ZL20__cudaFatCubinHandle
0000000000005070 b _ZL20__cudaFatCubinHandle
0000000000005050 b _ZL22__cudaPrelinkedFatbins
0000000000001b86 t _ZL24__sti____cudaRegisterAllv
000000000000191d t _ZL26__cudaUnregisterBinaryUtilv
0000000000001b20 t _ZL31__nv_cudaEntityRegisterCallbackPPv
0000000000005080 b _ZZL31__nv_cudaEntityRegisterCallbackPPvE5__ref
                 U __cudaInitModule@libcudart.so.12
                 U __cudaPopCallConfiguration@libcudart.so.12
                 U __cudaPushCallConfiguration@libcudart.so.12
                 U __cudaRegisterFatBinary@libcudart.so.12
                 U __cudaRegisterFatBinaryEnd@libcudart.so.12
                 U __cudaRegisterFunction@libcudart.so.12
0000000000001329 t __cudaUnregisterBinaryUtil
                 U __cudaUnregisterFatBinary@libcudart.so.12
                 U cudaFree@libcudart.so.12
                 U cudaLaunchKernel@libcudart.so.12
                 U cudaMalloc@libcudart.so.12
                 U cudaMemcpy@libcudart.so.1