checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.88k stars 583 forks source link

Optimizing the lazy migration process #2171

Open YTGhost opened 1 year ago

YTGhost commented 1 year ago

Description we are interested in lazy container migration between hosts, so we have carried ouot some tests using criu.

In our test environment, we use criu for lazy migration between hosts. When the machines are idle, the maximum bandwidth for migration is 110MB/s. If we do a certain level of access to memory on the target machine, the memory page transfer bandwidth is about 80MB/s. If we do a lot of access to memory, the memory page transfer bandwidth drops dramatically to about 2MB/s.

So, what I'm asking is whether there are any optimisation points in this process that would allow the bandwidth to be as close as possible to the maximum bandwidth when the destination host process is accessing memory frequently? In our experiments, we found that the destination host did not seem to be using the multi-core CPU effectively during the migration process, is this an optimisation point?

adrianreber commented 1 year ago

For best results I would try to combine lazy (post-copy) and pre-copy migration.

I also think that there is no single best solution. It will always depend on the memory access pattern of your application which combination of pre-copy and post-copy migration should be used.

YTGhost commented 1 year ago

For best results I would try to combine lazy (post-copy) and pre-copy migration.

Could you discuss in more detail how these two modes can be combined?

For both modes, we use post-copy for these two reasons:

  1. Under conditions of high peak single instance dirty page rate (1100MB/s) and tight bandwidth resources (1000Mbps), considering that migrating multiple instances simultaneously would further exacerbate the problem of bandwidth tightness, the Pre-Copy solution will fail in the worst-case scenario. The Post-Copy solution eliminates the bandwidth for iterative memory page copying, which helps alleviate this problem.
  2. Considering the constraint of service downtime (1s) and the runtime memory size of the instance (>20G), using the Post-Copy solution can quickly restore the instance with less memory. The Pre-Copy solution lacks the technology of iterative recovery, and after the last round of memory page transmission, it needs to read all data once to restore the instance, which will cause a longer service downtime.
adrianreber commented 1 year ago

I think the biggest problem with relying only on post-copy is that you might end up with one TCP connection per page fault.

As far as I remember there are no optimization around post-copy in CRIU. You could automatically fetch many pages upon page fault from the source side. Kind of prefetching couple of pages if one is requested.

Do you have a real test case or a synthetic test case. I am not sure it makes sense to optimize on a synthetic test case.

The whole post- copy support has a lot of optimization potential as we never implemented any optimization as each optimization is specific to each application.

Transferring pages per pre-copy in the beginning gives you the advantage that pages that do not have changed do not need to go through page fault which will always be slow from my point of view.

YTGhost commented 1 year ago

Sorry for the late reply due to the previous busy.

Do you have a real test case or a synthetic test case. I am not sure it makes sense to optimize on a synthetic test case.

Our real scenario is extremely complex, so far we just simply tested what happens to the Post-Copy memory page transfer rate if the target host accesses the memory frequently. That is, we are currently only in the preliminary verification whether it is feasible, but the measured transfer rate is really low (2MB/s) and so we want to optimize it.

I think the biggest problem with relying only on post-copy is that you might end up with one TCP connection per page fault.

If that's the case, this is indeed a big problem. Is there any way to optimize this problem? For example, can we reuse TCP connections? If the memory page transfer rate problem can be optimized, I think we would be happy to contribute the corresponding optimization code to CRIU!

adrianreber commented 1 year ago

If that's the case, this is indeed a big problem. Is there any way to optimize this problem? For example, can we reuse TCP connections? If the memory page transfer rate problem can be optimized, I think we would be happy to contribute the corresponding optimization code to CRIU!

Sure, that part can be optimized a lot. Currently, if I remember it correctly, it is unoptimized.

For each page fault going through userfaultfd you could automatically transfer multiple pages. Something like read ahead. If one page is requested you can automatically transfer the following couple of pages.

You could also implement something that the CRIU process on the source side always sends memory pages to the destination even if nothing is requested. If there is a request the request is handle, but as long nothing is requested, pages are sent automatically.

It has been a couple of years since this code has been touched, so I am not 100% sure how it works, but I think it was implemented without any optimizations and you are welcome to introduce optimizations.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.