checkpointing with CRIU

remyd1 commented 7 years ago

A new feature has been request :

Including CRIU in singularity for a full freezing and restoring process.

It is already done in docker, LXD, OpenVZ. It can also be used for live migration.

This would be a great function.

Best regards, Rémy

olifre commented 6 years ago

I found this related CRIU mailing list discussion: https://lists.openvz.org/pipermail/criu/2017-August/039133.html This is about checkpointing a full container from "outside" with CRIU, which apparently has some issues as of now.

oschulz commented 6 years ago

Snapshot/restore of Singularity containers via CRIU would be an awesome feature. :-)

vsoch commented 6 years ago

I think there are too many things on this list https://criu.org/What_cannot_be_checkpointed#Cannot_be_dumped_.28yet.29 to make that easily do-able, but perhaps some limited portion could be checkpointed? What does a checkpoint provide over just having the singularity image itself?

olifre commented 6 years ago

Hi @vsoch , I fully agree making it work is not easy - CRIU is a project with a long history and is being used with Docker, LXC etc. It's far from working perfectly out of the box, but it's improving step by step.

What does a checkpoint provide over just having the singularity image itself?

The concepts are completely orthogonal. A container image contains a runtime environment. Provided the container runtime (e.g. singularity) is installed on a site and resources are available, this allows to compute within the user defined environment. So I'd say having the image and singularity provides mobility of the possibility to compute. The next step is to achieve real mobility of compute. For this, it is necessary to be able to pause / kill a running compute job, migrate it to another machine, and continue from the point where the calculation was stopped.

In VM terms, this would relate to "the VM image" (= the container image) and the possibility to take a VM-snapshot (including memory and state) to be able to perform (live or offline) migration.

Only with this capability, full mobility of compute is achieved and it becomes feasible to use short-lived opportunistic resources for long-running compute jobs. Even on HPC farms, this would be good to have, to be able to preempt jobs to quickly free slots for jobs with high CPU count requirements.

So in short, checkpointing is orthogonal to having a container image, and a full checkpoint (nicest would be to be able to do that from outside the container) is required for full mobility of compute.

vsoch commented 6 years ago

i under the distinction, thank you for the detail! If a running process is akin to time, then pausing it is akin to controlling time. And wow, that would impressive - I do hope we can get to some reality like that.

oschulz commented 6 years ago

I fully agree with @olifre - this would give us mobility of compute in, literally, another dimension (time). Imagine starting a job on your puny laptop and then transferring it to a beefy machine when you arrive at the office. :-)

For scientific computing I see two main applications:

Running computations on dynamic clusters (e.g. including office machines which can join the cluster if idle and leave it again) via HTCondor and similar. So, as @olifre said, using short-lived opportunistic resources for long-running compute jobs.

HTCondor, for example, can kill a job when the machine it's running on is used for something with higher priority, but can also restart it later (possibly on another machine) - iff a checkpointing mechanism is in place. HTCondor has become very popular in high-energy physics, for example - but a checkpointing mechanism is often lacking, because the software doesn't support it. Singularity with CRIU would be a very elegant solution - completely transparent to a whole (potentially very complex) software stack. Of course CRIU itself may fail to dump one of the many things involved, but it would be a start.
Guarding long-running calculations against machine failures, reboots, etc., without writing custom checkpointing code.

jmjw commented 6 years ago

Hi, How would you handle distributed applications e. g. using MPI? There you have the MPI state in addition to the application state. In worst case, a transmission could be on its way on the cable/switches in the moment of checkpointing. How to handle that? Cheerio, Jan -- Jan Wender - j.wender@web.de

Am 22.11.2017 um 12:42 schrieb Oliver Schulz notifications@github.com:

I fully agree with @olifre - this would give us mobility of compute in, literally, another dimension (time). Imagine starting a job on your puny laptop and then transferring it to a beefy machine when you arrive at the office. :-)

For scientific computing I see two main applications:

Running computations on dynamic clusters (e.g. including office machines which can join the cluster if idle and leave it again) via HTCondor and similar. So, as @olifre said, using short-lived opportunistic resources for long-running compute jobs.

HTCondor, for example, can kill a job when the machine it's running on is used for something with higher priority, but can also restart it later (possibly on another machine) - iff a checkpointing mechanism is in place. HTCondor has become very popular in high-energy physics, for example - but a checkpointing mechanism is often lacking, because the software doesn't support it. Singularity with CRIU would be a very elegant solution - completely transparent to a whole (potentially very complex) software stack. Of course CRIU itself may fail to dump one of the many things involved, but it would be a start.

Guarding long-running calculations against machine failures, reboots, etc., without writing custom checkpointing code.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

planetA commented 6 years ago

One option is to stop the application (in sense of SIGSTOP), and then drain the queues on all the ranks to reach quiescent network state. You do not deliver messages to the application, but just keep them in such an intermediate state.

Now you can checkpoint and restore the application, replay the recorded messages and let the application run further.

This is how DMTCP, and some other mechanisms, perform Infiniband migration.

You would need some support from within the container. But OpenMPI, for example, already has plugins for checkpoint/restart.

On 11/24/2017 01:47 PM, Jan Wender wrote:

Hi, How would you handle distributed applications e. g. using MPI? There you have the MPI state in addition to the application state. In worst case, a transmission could be on its way on the cable/switches in the moment of checkpointing. How to handle that? Cheerio, Jan -- Jan Wender - j.wender@web.de

Am 22.11.2017 um 12:42 schrieb Oliver Schulz notifications@github.com:

I fully agree with @olifre - this would give us mobility of compute in, literally, another dimension (time). Imagine starting a job on your puny laptop and then transferring it to a beefy machine when you arrive at the office. :-)

For scientific computing I see two main applications:

Running computations on dynamic clusters (e.g. including office machines which can join the cluster if idle and leave it again) via HTCondor and similar. So, as @olifre said, using short-lived opportunistic resources for long-running compute jobs.

HTCondor, for example, can kill a job when the machine it's running on is used for something with higher priority, but can also restart it later (possibly on another machine) - iff a checkpointing mechanism is in place. HTCondor has become very popular in high-energy physics, for example - but a checkpointing mechanism is often lacking, because the software doesn't support it. Singularity with CRIU would be a very elegant solution - completely transparent to a whole (potentially very complex) software stack. Of course CRIU itself may fail to dump one of the many things involved, but it would be a start.

Guarding long-running calculations against machine failures, reboots, etc., without writing custom checkpointing code.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/singularityware/singularity/issues/468#issuecomment-346821514, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKZxqJzruwKJCiXpIMRyckXlPUx-7MEks5s5rr8gaJpZM4LviPx.

-- Regards, Maksym Planeta

Maaarcocr commented 6 years ago

Is this still a desired feature? I would love to be able to do this and I may be able to spend some time trying to implement what is required to let CRIU checkpoint and restore a singularity container, but I'm not sure where to start from.

GodloveD commented 6 years ago

Heya @Maaarcocr! Yeah I think there is a lot of interest in this. But I don't think anyone is working on it right now. If you want to give something a go that would be amazing!

Maaarcocr commented 6 years ago

@GodloveD I'm fairly new to singularity, what would be the best way to approach this?

planetA commented 6 years ago

Hello,

I'm also interested in this feature. And would love to contribute, especially if there is some mentoring from your side, because I'm new to singularity.

On 01/19/2018 02:20 PM, David Godlove wrote:

Heya @Maaarcocr https://github.com/maaarcocr! Yeah I think there is a lot of interest in this. But I don't think anyone is working on it right now. If you want to give something a go that would be amazing!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/singularityware/singularity/issues/468#issuecomment-358964227, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKZxh80nU4KcFC0yRMC1Ke6BjpfD_zKks5tMJaxgaJpZM4LviPx.

-- Regards, Maksym Planeta

olifre commented 6 years ago

I'm also very interested in the feature, but currently do not have capacity to work on it. I would say it's certainly best to start with testing CRIU in four configurations:

Run CRIU inside of singularity, checkpointing the full process tree inside.
Run singularity inside of CRIU, i.e. try to checkpoint full singularity process tree.

I would test these two configurations first with user namespaces, i.e. non-setuid, on a modern kernel, since there should be official support by CRIU, hence "4 configurations". I expect 1. is easier to get to work, since CRIU should not have to do anything special. for singularity at all.

Then, I'd look at what exactly fails. Since CRIU should support every application, it might be best to understand the issues of CRIU first, and only in a second step (after attempting to fix things in CRIU / the kernel checkpointing / restore feature) see if Singularity can be adapted to make checkpointing easier / integrate with CRIU. In any case, the first sensible steps don't require any knowledge about Singularity, I'd say, only how to use and configure it. The significantly larger issue is to understand checkpoint / restore, and what is missing there.

emilydolson commented 6 years ago

Chiming in as another person who is super interested in seeing this feature exist! I took a stab at it a while ago but had to give up because the kernel my university's cluster runs on is currently too old for CRIU to work (as far as I can tell, at least).

oschulz commented 6 years ago

I'll just chime in, it would be an awesome feature!

remyd1 commented 6 years ago

As I submitted the issue, I am also very interested in it. However like many other stuffs I do not have much free time for it...

olifre commented 6 years ago

I found the following sentence:

Support for Checkpoint Restart: Internal support for checkpoint-restarting for mobility of state

on a slide about Singularity 3.0 from @bauerm97 shown at the CernVM Users Workshop ( https://indico.cern.ch/event/608592/contributions/2830120/attachments/1592403/2520972/CernVM_Workshop.pdf ). What's that about?

GodloveD commented 6 years ago

@olifre it's on the roadmap. The basic idea is that one of the data objects within the SIF format could save the state of the container when it is paused. Then you can move your container to a new environment and Singularity will know how to start it again.

planetA commented 6 years ago

I previously expressed an intention to help out with CRIU checkpointing. Unfortunatelly, I was terribly overwhelmed with other work and could not engage into work with CRIU. Now I have more time and eager to participate. I tried out some simple things, but CRIU does simple make a dump because of complicated structure of namespaces.

I believe my work will be more efficient if somebody can give me some guidance how to proceed. Would anybody volunteer?

avagin commented 6 years ago

I believe my work will be more efficient if somebody can give me some guidance how to proceed. Would anybody volunteer?

@planetA I can give you a small intro.

mmore500 commented 5 years ago

For those looking to checkpoint-restart in Singularity containers, I got a minimum working example working using DMTCP. It successfully checkpoint-restarts a simple executable running inside a Singularity container on my local machine and on a CircleCI virtual machine. I'm still working on getting it running properly in a cluster computing context. You can find the source here and pull it down for an interactive demonstration. Anyways, hope this might be useful!

xinli-git commented 3 years ago

I'll chime in as well, it would be an awesome feature for many of the on-spot instances as well as preemption based HPC systems.

carterpeel commented 3 years ago

Hello,

This is a templated response that is being sent out to all open issues. We are working hard on 'rebuilding' the Singularity community, and a major task on the agenda is finding out what issues are still outstanding.

Please consider the following:

Is this issue a duplicate, or has it been fixed/implemented since being added?
Is the issue still relevant to the current state of Singularity's functionality?
Would you like to continue discussing this issue or feature request?

Thanks, Carter

olifre commented 3 years ago

@carterpeel This issue is still highly relevant and as you can find in the earlier comments, it's on the roadmap. So this templated comment is not really helpful, it does not even specify how to respond, nor does it ease reading through the issue.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in over 60 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

vsoch commented 3 years ago

Don't close stalebot, the contributors to this issue have responded and it's the maintainers that have not.

pedroalvesbatista commented 3 years ago

@olifre We're looking into the issue carefully, soon will bring to community and discuss ways to better solve as well address this. Thankyou for keeping the interest in the subject.

DrDaveD commented 2 years ago

Transfer further discussion on this to apptainer/apptainer#16.

apptainer / singularity

checkpointing with CRIU #468