apptainer / apptainer

Apptainer: Application containers for Linux
https://apptainer.org
Other
1.13k stars 139 forks source link

Checkpointing of containers run with `apptainer run` #2186

Open Pigrenok opened 6 months ago

Pigrenok commented 6 months ago

That is more of a feature request/clarification rather than a bug.

So, I need to checkpoint a container that was started with the apptainer run ... command rather than the apptainer instance start/run command. The container runs a very long analysis and it would be helpful to checkpoint it from time to time so, that if something goes wrong with the container runtime or host system, the analysis can be started from the last checkpoint instead of the beginning.

The problem is that apptainer instance run/start command works as a service (as expected) and

  1. It does not stop when the runscript finishes.
  2. Output of the runscript is not really accessible unless it is written to a file inside the runscript.

I can see the issue with using apptainer run in this case as it runs in the foreground, but it is not possible to set up a checkpoint saving loop after launching the container unless it is sent to the background. But that will still solve both issues mentioned above...

Unfortunately, I do not see how the container ran with apptainer run can be checkpointed as this command does not have --dmtcp_... options to launch or restart it and thus does not allow to associate the checkpoint location.

Is it even possible to do that way?

Thank you very much in advance.

ikaneshiro commented 1 month ago

Hi @Pigrenok, apologies for taking so long to get a response to you.

The checkpointing feature was initially only designed to be used for instances because of the ability of apptainer to join a running instance container and execute commands (like invoking dmtcp to create a checkpoint).

There are other ways to trigger a checkpoint event with dmtcp like specifying an interval to automatically checkpoint or sending a signal to the process. Extending the ability to use checkpoints beyond instances was initially a goal of the first implementation, but I encountered several issues with the timer not correctly triggering in the various environments I tested against and ultimately decided to limit the functionality to just manually triggered checkpoints on instances.

I am curious if you can benefit from running a checkpoint at an interval in your specific container environment, so I would recommend attempting to build dmtcp into your application container and updating your runscriptto use the dmtcp_launcher to start your process:

dmtcp_launch --coord-port 0 -interval <interval-in-seconds> --ckptdir <bound-from-host> --no-gzip --ckpt-open-files <app-args...>

It may be more stable in newer versions and when build within the container environment instead of bound in from the host.

DrDaveD commented 1 month ago

Perhaps the solution to this should be to just document how to build dmtcp into a container.