CRAC restarts lammps failed after checkpointing

GoodKairos commented 2 years ago

Dear Sir, I have a question about CRAC restarts the lammps failed. There are three steps: (1) Compile lammps accelerated with gpu package a. cmake ../cmake -DPKG_GPU=on -DGPU_API=cuda -DGPU_PREC=double -DCUDPP_OPT=no -DCUDA_MPS_SUPPORT=yes -DBUILD_MPI=yes -DBUILD_OMP=yes -DGPU_ENABLE_CUDA_UVM=on -DPKG_MANYBODY=yes -DPKG_MOLECULE=yes -DPKG_MISC=yes -DPKG_KSPACE=yes -DPKG_USER-REAXC=yes -DBUILD_SHARED_LIBS=yes -DCMAKE_INSTALL_PREFIX=/home/lammps/lammps_install b. make -j10 && make install (2) CRAC checkpoints LAMMPS a. Run coordinator: /home/CRAC-early-development-master/bin/dmtcp_coordinator --port 7790 b. Run dmtcp_launch: /home/CRAC-early-development-master/bin/dmtcp_launch --cuda --interval 1 --coord-host 127.0.1.1 --coord-port 7790 --kernel-loader /home/CRAC-early-development-master/contrib/split-cuda/kernel-loader.exe --target-ld /usr/local/glibc-2.31/lib/ld-linux-x86-64.so.2 --with-plugin /home/CRAC-early-development-master/contrib/split-cuda/libdmtcp_split-cuda.so -j /home/lammps/lammps_install/bin/lmp -sf gpu -pk gpu 1 -in /home/lammps/mylammps/examples/balance/in.balance

Checkpointing is successfully.

But when I use CRAC restarting the lammps failed. My command is as following: a. Copy kernel-loader.exe to current dir. cp /home/CRAC-early-development-master/contrib/split-cuda/kernel-loader.exe . b. Run dmtcp_restart: /home/CRAC-early-development-master/bin/dmtcp_restart --cuda --coord-host 127.0.1.1 --coord-port 7790 ckptkernel-loader.exe*.dmtcp

The source code in lammps is as following: CUresult err = cuInit(0); sleep(5); // give enough time to checkpoint if (err == CUDA_SUCCESS) checkCudaErrors(cuDeviceGetCount(&deviceCount)); printf("> device count = %d\n", deviceCount);

err = cuDriverGetVersion(&driverVersion); printf("> Get driver Version: %d\n", driverVersion);

// get first CUDA device err = cuDeviceGet(&device, 0); if (err == CUDA_SUCCESS) { printf("> Get device[1]: %d.\n", device); }

char name[100]; err = cuDeviceGetName(name, 100, device); if (err != CUDA_SUCCESS) { printf("> Get name, ret = %d.\n", err); } printf("> Using device 0: %s\n", name); err = cuDeviceTotalMem(&totalGlobalMem, device); printf("> Get total memory, ret = %d.\n", err); printf(" Total amount of global memory: %llu bytes\n", (unsigned long long)totalGlobalMem);

The error is as the following: Called at file 'cuDeviceGetName' in line. > Get name, ret = 304.

Using device 0: ^ Called at file 'cuDeviceTotalMem_v2' in line. In logAPI cuDeviceTotalMem_v2:start > Get total memory, ret = 2. Total amount of global memory: 0 bytes

cuInit(0) is in the log and replay, cuDeviceGetName returns CUDA_ERROR_OPERATING_SYSTEM(304) and cuDeviceTotalMem returns CUDA_ERROR_OUT_OF_MEMORY(2) after log_and_replaying "cuInit(0)" in the CRAC_RESTART. I do not know the root cause of this issue in restarting lammps. Could you help me? Thank you very much.

gc00 commented 2 years ago

Hi Twinkle, Do you have some advice for GoodKairos?

@GoodKairos, Please be aware that the CRAC project was initially done as an academic prototype. We are currently working on polishing it to make it more production-worthy. But it is still a work in progress.

Best,

Gene

GoodKairos commented 2 years ago

Hi Mr.Gene， Thank you very much for your reply.

Kairos

DMTCP-CRAC / CRAC-early-development

CRAC restarts lammps failed after checkpointing #3