Open Pigrenok opened 6 months ago
Hi @Pigrenok, apologies for taking so long to get a response to you.
The checkpointing feature was initially only designed to be used for instances because of the ability of apptainer
to join a running instance container and execute commands (like invoking dmtcp
to create a checkpoint).
There are other ways to trigger a checkpoint event with dmtcp
like specifying an interval to automatically checkpoint or sending a signal to the process. Extending the ability to use checkpoints beyond instances was initially a goal of the first implementation, but I encountered several issues with the timer not correctly triggering in the various environments I tested against and ultimately decided to limit the functionality to just manually triggered checkpoints on instances.
I am curious if you can benefit from running a checkpoint at an interval in your specific container environment, so I would recommend attempting to build dmtcp
into your application container and updating your runscript
to use the dmtcp_launcher
to start your process:
dmtcp_launch --coord-port 0 -interval <interval-in-seconds> --ckptdir <bound-from-host> --no-gzip --ckpt-open-files <app-args...>
It may be more stable in newer versions and when build within the container environment instead of bound in from the host.
Perhaps the solution to this should be to just document how to build dmtcp into a container.
That is more of a feature request/clarification rather than a bug.
So, I need to checkpoint a container that was started with the
apptainer run ...
command rather than theapptainer instance start/run
command. The container runs a very long analysis and it would be helpful to checkpoint it from time to time so, that if something goes wrong with the container runtime or host system, the analysis can be started from the last checkpoint instead of the beginning.The problem is that
apptainer instance run/start
command works as a service (as expected) andI can see the issue with using
apptainer run
in this case as it runs in the foreground, but it is not possible to set up a checkpoint saving loop after launching the container unless it is sent to the background. But that will still solve both issues mentioned above...Unfortunately, I do not see how the container ran with
apptainer run
can be checkpointed as this command does not have--dmtcp_...
options to launch or restart it and thus does not allow to associate the checkpoint location.Is it even possible to do that way?
Thank you very much in advance.