DMTCP-CRAC / CRAC-early-development

Other
22 stars 8 forks source link

Failed to create checkpoint #9

Open avivMahulya opened 1 year ago

avivMahulya commented 1 year ago

I got an error when I am trying to create checkpoint from my application. I'm using CUDA 11.4 and tensorRT 8.4 in my application. My plaform is Nvidia jetson Xavier NX. ARM®v8.2 64 Ubuntu 20.04.4 LTS

I got the following error in the dmtcp_launch terminal:

[41000] ERROR at fileconnlist.cpp:396 in prepareShmList; REASON='JASSERT(Util::strEndsWith(area.name, DELETED_FILE_SUFFIX)) failed' area.name = /dmabuf:

The full log is attached. Checkpoint error dmtcp.txt

gc00 commented 1 year ago

@JainTwinkle , Do you have some advice here?

avivMahulya commented 1 year ago

When Im trying to create checkpoint of small python script (without CUDA) I got the following error:

[40000] NOTE at writeckpt.cpp:263 in mtcp_writememoryareas; REASON='before calling to skip' (void )area.addr = 0x400000 (void )area.endAddr = 0x8ba000 area.size = 4956160 Segmentation fault

with DMTCP ver 2.6 I succeed to checkpoint and restore this python script