checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.97k stars 596 forks source link

restore java process schedule task don't work #1719

Open kuailelijuan opened 2 years ago

kuailelijuan commented 2 years ago

Program description: jdk version:jdk1.8.0_60 key code:

long initPeriod = 0l;
            long retryPeriod = 30000l;
            Executors.newSingleThreadScheduledExecutor(new NamedThreadFactory("YPF-fixedRate", true)).scheduleAtFixedRate(new Runnable() {
                @Override
                public void run() {
                    try {
                        Thread thread = Thread.currentThread();
                        log.info("YPF fixedRate scheduled service!");
                        log.info(thread.toString());
                        log.info(thread.getId() + " is run.");
                    } catch (Throwable t) {
                        t.printStackTrace();
                    }
                }
            }, initPeriod, retryPeriod, TimeUnit.MILLISECONDS);
            Executors.newSingleThreadScheduledExecutor(new NamedThreadFactory("YPF-WithFixedDelay", true)).scheduleWithFixedDelay(new Runnable() {
                @Override
                public void run() {
                    try {
                        Thread thread = Thread.currentThread();
                        log.info("YPF WithFixedDelay scheduled service!");
                        log.info(thread.toString());
                        log.info(thread.getId() + " is run.");
                    } catch (Throwable t) {
                        t.printStackTrace();
                    }
                }
            }, initPeriod, retryPeriod, TimeUnit.MILLISECONDS);           

problem: In this java process, two time tasks are started, using two methods: scheduleAtFixedRate and scheduleWithFixedDelay. I found two problems.

  1. When using criu to execute restore, occasionally the scheduled task hangs and does not continue to work. For example: I used criu to restore 10 processes, on different machines, the process time tasks on 5 machines do not work, and the rest work normally. I do not know why.
  2. The recovered workable process, the scheduled task of scheduleAtFixedRate, will repeatedly execute the task from the output dump time to the current time. For example: dump executed at 16:37, restore executed at 18:37, scheduleAtFixedRate task executed once 30s, then scheduleAtFixedRate will be executed instantly (18:37 - 16:37) / 30 = 240 times. This is very dangerous.
adrianreber commented 2 years ago

That is an interesting situation, indeed. I am not sure, however, that CRIU can do anything about. This could be seen as a Java feature or a bug. Java assumes if two hours passed since the last check it has to run all missing runs. It probably should not.

Working on CRIU I am biased but this sounds like a thing Java needs to solve itself. You would probably have the same problem on a virtual machine that is paused for two hours and on unpause the clock jumps to the correct time. Java should not expect that the clock cannot jump.

kuailelijuan commented 2 years ago

@adrianreber It is an honor to receive your answer. I recently encountered these problems when experimenting with criu. Hope you can give me some guidance and advice. 1. During the process of live migration, it is likely to encounter the problems of the above java time schedule tasks. In addition, some changes in the physical state of the target process ip, as well as the cached data of the application process, sessionID, etc., related to the process logic may not work properly. In order to use the criu , additional development is required. Usually, these status points are difficult to be completely sorted out, and the workload of transformation is huge. How can the application system enjoy the convenience of criu at a lower cost. 2. What will be the future development trend and planning of criu. What areas will his application scenarios be in?

adrianreber commented 2 years ago

I don't think this is the right place to discuss such broad open questions. If you have a specific problem we are happy to help if possible but your question seems to be about almost everything related to CRIU.

avagin commented 2 years ago

@kuailelijuan what kernel do you use? Does it support time namespaces? What clocks is used to implement newSingleThreadScheduledExecutor? You can find this if you run your app under "strace -fo strace.log -s 1024".

kuailelijuan commented 2 years ago

I don't think this is the right place to discuss such broad open questions. If you have a specific problem we are happy to help if possible but your question seems to be about almost everything related to CRIU.

Thanks a lot for your answer. There is a problem I don't understand, why the java schedule timed task will not take effect during the process of hot migration. This is an accidental phenomenon, and not every node will appear. I am experimenting in some docker container, docker version: 17.03.2 The same node, if his schedule does not take effect, after I kill it and try again many times, it will not take effect also. On the contrary, if one node, the schedule can be executed normally, and it is normal to re-operate many times.

adrianreber commented 2 years ago

We are no experts how Java implements the scheduled task. It seems to look at the date and time and if you migrate to a system which has a clock in the past then it might skip a few scheduled tasks. You need to ensure that clocks are properly synced if your process depends on the wallclock. Again, this is a topic you need to talk to Java experts about it. We can only speculate.

For time sensitive migrations time namespaces can be a solution, but so far no container runtime is really using them. CRIU has the necessary support.

kuailelijuan commented 2 years ago

@kuailelijuan what kernel do you use? Does it support time namespaces? What clocks is used to implement newSingleThreadScheduledExecutor? You can find this if you run your app under "strace -fo strace.log -s 1024".

kernel : Linux version 4.18.0 i execute "strace -fo strace.log -s 1024 " get the following log :

10354 16:41:43 restart_syscall(<... resuming interrupted call ...> <unfinished ...>
10354 16:42:04 <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
10354 16:42:04 futex(0x7f0e6c002eb8, FUTEX_WAKE_PRIVATE, 1) = 0
10354 16:42:04 futex(0x7f0e6c002c54, FUTEX_WAIT_BITSET_PRIVATE, 17241, {9351494, 549672793}, ffffffff <unfinished ...>
10354 16:42:04 <... futex resumed> )    = 0
10354 16:42:04 futex(0x7f0e6c002c28, FUTEX_WAKE_PRIVATE, 1) = 0
10354 16:42:04 futex(0x7f0e6c002c54, FUTEX_WAIT_PRIVATE, 17243, NULL <unfinished ...>
10354 16:42:04 <... futex resumed> )    = 0
10354 16:42:04 futex(0x7f0e6c002c28, FUTEX_WAKE_PRIVATE, 1) = 0
10354 16:42:04 futex(0x7f0e6c002c54, FUTEX_WAIT_PRIVATE, 17245, NULL <unfinished ...>
10354 16:42:04 <... futex resumed> )    = 0
10354 16:42:04 futex(0x7f0e6c002c28, FUTEX_WAKE_PRIVATE, 1) = 0
10354 16:42:04 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:04,166 com.my.timer.Controller$2.run(TimerController.java:479) [185339417]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:42:04 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:04,166 com.my.timer.Controller$2.run(TimerController.java:479) [185339417]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:42:04 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:04,167 com.my.timer.Controller$2.run(TimerController.java:480) [185339418]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:42:04 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:04,167 com.my.timer.Controller$2.run(TimerController.java:480) [185339418]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:42:04 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:04,167 com.my.timer.Controller$2.run(TimerController.java:481) [185339418]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:42:04 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:04,167 com.my.timer.Controller$2.run(TimerController.java:481) [185339418]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:42:04 futex(0x7f0e6c002ee4, FUTEX_WAIT_BITSET_PRIVATE, 1, {9351524, 550045461}, ffffffff <unfinished ...>
10354 16:42:34 <... futex resumed> )    = -1 ETIMEDOUT (Connection timed out)
10354 16:42:34 futex(0x7f0e6c002eb8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
10354 16:42:34 <... futex resumed> )    = 0
10354 16:42:34 futex(0x7f0e6c002c54, FUTEX_WAIT_BITSET_PRIVATE, 17247, {9351524, 551365667}, ffffffff <unfinished ...>
10354 16:42:34 <... futex resumed> )    = 0
10354 16:42:34 futex(0x7f0e6c002c28, FUTEX_WAKE_PRIVATE, 1) = 0
10354 16:42:34 futex(0x7f0e6c002c54, FUTEX_WAIT_PRIVATE, 17249, NULL <unfinished ...>
10354 16:42:34 <... futex resumed> )    = 0
10354 16:42:34 futex(0x7f0e6c002c28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
10354 16:42:34 <... futex resumed> )    = 0
10354 16:42:34 futex(0x7f0e6c002c54, FUTEX_WAIT_PRIVATE, 17251, NULL <unfinished ...>
10354 16:42:34 <... futex resumed> )    = 0
10354 16:42:34 futex(0x7f0e6c002c28, FUTEX_WAKE_PRIVATE, 1) = 0
10354 16:42:34 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:34,168 com.my.timer.Controller$2.run(TimerController.java:479) [185369419]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:42:34 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:34,168 com.my.timer.Controller$2.run(TimerController.java:479) [185369419]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:42:34 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:34,170 com.my.timer.Controller$2.run(TimerController.java:480) [185369421]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:42:34 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:34,170 com.my.timer.Controller$2.run(TimerController.java:480) [185369421]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:42:34 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:34,170 com.my.timer.Controller$2.run(TimerController.java:481) [185369421]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:42:34 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:42:34,170 com.my.timer.Controller$2.run(TimerController.java:481) [185369421]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:42:34 futex(0x7f0e6c002ee4, FUTEX_WAIT_BITSET_PRIVATE, 1, {9351554, 552746621}, ffffffff <unfinished ...>
10354 16:43:04 <... futex resumed> )    = -1 ETIMEDOUT (Connection timed out)
10354 16:43:04 futex(0x7f0e6c002eb8, FUTEX_WAKE_PRIVATE, 1) = 0
10354 16:43:04 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:04,170 com.my.timer.Controller$2.run(TimerController.java:479) [185399421]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:43:04 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:04,170 com.my.timer.Controller$2.run(TimerController.java:479) [185399421]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:43:04 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:04,171 com.my.timer.Controller$2.run(TimerController.java:480) [185399422]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:43:04 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:04,171 com.my.timer.Controller$2.run(TimerController.java:480) [185399422]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:43:04 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:04,171 com.my.timer.Controller$2.run(TimerController.java:481) [185399422]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:43:04 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:04,171 com.my.timer.Controller$2.run(TimerController.java:481) [185399422]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:43:04 futex(0x7f0e6c002ee4, FUTEX_WAIT_BITSET_PRIVATE, 1, {9351584, 553452477}, ffffffff <unfinished ...>
10354 16:43:34 <... futex resumed> )    = -1 ETIMEDOUT (Connection timed out)
10354 16:43:34 futex(0x7f0e6c002eb8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
10354 16:43:34 <... futex resumed> )    = 0
10354 16:43:34 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:34,171 com.my.timer.Controller$2.run(TimerController.java:479) [185429422]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:43:34 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:34,171 com.my.timer.Controller$2.run(TimerController.java:479) [185429422]   INFO - ch.web.action.TimerController - YPF WithFixedDelay scheduled service! \n", 218) = 218
10354 16:43:34 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:34,171 com.my.timer.Controller$2.run(TimerController.java:480) [185429422]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:43:34 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:34,171 com.my.timer.Controller$2.run(TimerController.java:480) [185429422]   INFO - ch.web.action.TimerController - Thread[YPF-WithFixedDelay-15-T-1,5,main] \n", 221) = 221
10354 16:43:34 write(1, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:34,172 com.my.timer.Controller$2.run(TimerController.java:481) [185429423]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:43:34 write(62, "YPF-WithFixedDelay-15-T-1 2022-01-14 16:43:34,172 com.my.timer.Controller$2.run(TimerController.java:481) [185429423]   INFO - ch.web.action.TimerController - 87 is run. \n", 191) = 191
10354 16:43:34 futex(0x7f0e6c002ee4, FUTEX_WAIT_BITSET_PRIVATE, 1, {9351614, 554179247}, ffffffff <unfinished ...>
304   16:43:36 futex(0x7f0f80114154, FUTEX_WAIT_BITSET_PRIVATE, 1, {9351587, 201035457}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
kuailelijuan commented 2 years ago

We are no experts how Java implements the scheduled task. It seems to look at the date and time and if you migrate to a system which has a clock in the past then it might skip a few scheduled tasks. You need to ensure that clocks are properly synced if your process depends on the wallclock. Again, this is a topic you need to talk to Java experts about it. We can only speculate.

For time sensitive migrations time namespaces can be a solution, but so far no container runtime is really using them. CRIU has the necessary support.

thanks.

rst0git commented 2 years ago

so far no container runtime is really using them.

crun does support time namespace (https://github.com/containers/crun/pull/257) and container checkpoint/restore, but Adrian is right. So far we have not seen many discussions about it.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

avagin commented 2 years ago

For time sensitive migrations time namespaces can be a solution, but so far no container runtime is really using them.

It isn't required. CRIU can restore processes in a new time namespace by default if the kernel allows doing that.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.