garethgeorge / backrest

Backrest is a web UI and orchestrator for restic backup.
GNU General Public License v3.0
793 stars 31 forks source link

Hybrid-Sleep/Hibernation might break scheduling #172

Open ArthurusDent opened 3 months ago

ArthurusDent commented 3 months ago

Describe the bug It seems that after entering hybrid-sleep on my two Kubuntu 23.10 machines and turning them back on, backup scheduling stops to work at some point.

After a reboot, scheduling works as expected again.

To Reproduce I'm not entirely sure right now.

My machines are configured to enter hybrid-sleep after a certain amount of time. This means they enter hybrid-sleep for various amounts of time. Either when I take a longer break, like lunch, or when I don't return to work and cut the energy supply over night.

I don't know whether breaking of scheduling is dependent on how long the machine was off.

I have seen repeatedly that my two hourly backup plans (local and remote) often run once in the morning when I start the machines again and a next run for the backup plan is being scheduled but it never gets executed (see screenshot).

I'm using KDE's power management but if you need to test, those commands work on my system: $ sudo systemctl suspend $ sudo systemctl hibernate $ sudo systemctl hybrid-sleep

Expected behavior Backrest should not be thrown off by any energy savings measures.

Screenshots image

Desktop (please complete the following information):

Additional context backrest-scheduling.log

garethgeorge commented 3 months ago

Hey, thanks for the report. I was a bit worried about sleep behavior when writing the scheduler. Backrest is configured to poll the system clock every 15 minutes to account for time jumps (which is how a sleep / wake looks to the timers). But it sounds like you're saying that after a sleep it becomes hung and simply never runs any more operations?

Just to confirm -- the bug isn't simply a timer overrun but scheduling stops permanently? The screenshot doesn't make the time of the screenshot clear so it's a bit hard to tell how much it's overrun by. It'd also be great to know the rough times of sleep and wakeup so I can cross reference with the log. Lastly, I'm only seeing scheduling events in the log but not the task execution? Are those filtered out?

ArthurusDent commented 3 months ago

Hey, I looked at my second system yesterday and to be honest, to me the scheduling/execution behavior looks kind of erratic. I can't really predict when it will stop. It doesn't always seem to stop after resuming from hybrid-sleep. It also seems to start again sometimes but not always.

The screenshot is from 22-03-24, so backrest was stuck since 20-03-24.

Sorry, the logs are incomplete. I experimented with different commands for extraction and sent an incomplete one. Here is a just slightly edited version that includes backrest + hybrid-sleep + system start events, so you can tell when backrest started fresh and worked again.

Command: cat /var/log/syslog | grep 'hybrid-sleep\|backrest\|Linux version' > backrest_hybrid-sleep_system-start.log

Output: backrest_hybrid-sleep_system-start.log

garethgeorge commented 3 months ago

Ah, I see so you resumed from sleep a few times and backrest's scheduling loop looks completely hung. It never resumed scheduling tasks. It looks like you restarted it at which point scheduling resumes.

ArthurusDent commented 3 months ago

Exactly. If you look at the screenshot and take the last snapshot id c39ef56f (at 2024-03-20T09:36:21.453093+01:00) you see that right before creation the system resumed from hybrid-sleep. Backrest still does run some plans but scheduling doesn't happen any more and consequently no further plans are being executed.

Then at 2024-03-20T15:29:44.999662+01:00 you see that several hybrid-sleeps and resumes happened, which means backrest didn't run over two days and at 2024-03-22T10:45:30.522422+01:00 the system has rebooted and backrest starts scheduling and executing plans again.

garethgeorge commented 3 months ago

Okay, I've put out a new commit (https://github.com/garethgeorge/backrest/commit/5b7e2b080d31a2f77a5f9b6737dfbb84cfb63cce) which increases the polling frequency and ƒixes a missing log statement which will print whether the tasks are being rescheduled or not (right now we only see when a task runs but not if it was enqueued). If you have the ability to build backrest from source, you might try running this for a few days to see if it catches any scheduling mishap.

I increased the polling interval to every 5 minutes b/c honestly 15 is quite long. I see a few blocks of time where it looks like the computer woke up for very short durations and then was suspended again in these logs:

2024-03-20T09:51:33.934193+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.934+0100#011#033[34mINFO#033[0m#011scheduling task#011{"task": "forget for plan \"System-Raspi2\"", "runAt": "2024-03-20T09:51:33+01:00"}
2024-03-20T09:51:33.934243+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.934+0100#011#033[34mINFO#033[0m#011scheduling task#011{"task": "index snapshots for plan \"restic-backrest-raspi2\"", "runAt": "2024-03-20T09:51:33+01:00"}
2024-03-20T09:51:33.940197+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.940+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "backup for plan \"System-Raspi2\"", "duration": "1m33.842470459s"}
2024-03-20T09:51:33.946185+01:00 Hostname backrest[1663]: 2024-03-20T09:51:33.946+0100#011#033[34mINFO#033[0m#011running task#011{"task": "forget for plan \"System-Raspi2\""}
2024-03-20T09:51:34.775294+01:00 Hostname backrest[1663]: 2024-03-20T09:51:34.775+0100#011#033[35mDEBUG#033[0m#011Forgot snapshots#011{"plan": "System-Raspi2", "count": 0, "policy": "keep_hourly:24  keep_daily:7  keep_weekly:6  keep_monthly:12  keep_yearly:1"}
2024-03-20T09:51:34.781519+01:00 Hostname backrest[1663]: 2024-03-20T09:51:34.781+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "forget for plan \"System-Raspi2\"", "duration": "835.323148ms"}
2024-03-20T09:51:34.781574+01:00 Hostname backrest[1663]: 2024-03-20T09:51:34.781+0100#011#033[34mINFO#033[0m#011running task#011{"task": "index snapshots for plan \"restic-backrest-raspi2\""}
2024-03-20T09:51:35.592531+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.592+0100#011#033[35mDEBUG#033[0m#011Indexed known (and not forgotten) snapshot IDs for plan restic-backrest-raspi2 in 911.649µs
2024-03-20T09:51:35.599899+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.599+0100#011#033[35mDEBUG#033[0m#011Indexed snapshots#011{"repo": "restic-backrest-raspi2", "duration": "7.683602ms", "alreadyIndexed": 163, "newlyAdded": 1, "markedForgotten": 0}
2024-03-20T09:51:35.599982+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.599+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "index snapshots for plan \"restic-backrest-raspi2\"", "duration": "818.363871ms"}
2024-03-20T15:29:44.999662+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...
2024-03-20T15:29:45.189020+01:00 Hostname systemd-sleep[52379]: Entering sleep state 'hybrid-sleep'...
2024-03-21T09:28:47.215540+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Deactivated successfully.
2024-03-21T09:28:47.215632+01:00 Hostname systemd[1]: Finished systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate.
2024-03-21T09:28:47.215669+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Consumed 6.321s CPU time.
2024-03-21T09:28:47.215871+01:00 Hostname systemd[1]: Reached target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T09:28:47.251374+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T14:58:59.675155+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...
2024-03-21T14:58:59.901816+01:00 Hostname systemd-sleep[76676]: Entering sleep state 'hybrid-sleep'...
2024-03-21T19:26:06.004831+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Deactivated successfully.
2024-03-21T19:26:06.004957+01:00 Hostname systemd[1]: Finished systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate.
2024-03-21T19:26:06.005566+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Consumed 8.359s CPU time.
2024-03-21T19:26:06.005640+01:00 Hostname systemd[1]: Reached target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T19:26:06.043882+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T19:32:22.478564+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...
2024-03-21T19:32:22.761336+01:00 Hostname systemd-sleep[77305]: Entering sleep state 'hybrid-sleep'...
2024-03-22T09:08:25.066256+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Deactivated successfully.
2024-03-22T09:08:25.066442+01:00 Hostname systemd[1]: Finished systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate.
2024-03-22T09:08:25.066492+01:00 Hostname systemd[1]: systemd-hybrid-sleep.service: Consumed 4.275s CPU time.
2024-03-22T09:08:25.066799+01:00 Hostname systemd[1]: Reached target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-22T09:08:25.090609+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-22T10:45:04.439577+01:00 Hostname systemd[1]: Stopping backrest.service - Backrest Service...
2024-03-22T10:45:04.442812+01:00 Hostname backrest[1663]: 2024-03-22T10:45:04.442+0100#011#033[34mINFO#033[0m#011shutting down orchestrator loop, context cancelled.
2024-03-22T10:45:04.442862+01:00 Hostname backrest[1663]: 2024-03-22T10:45:04.442+0100#011#033[34mINFO#033[0m#011HTTP gateway shutdown

(Let me know if I'm reading that incorrectly)

The fact that backrest prints "shutting down orchestrator loop, context cancelled." means the orchestrator loop was still running after suspend. It's possible that, at least during that block of time, the computer was never awake long enough for backrest to detect the system wakeup and run a task.

ArthurusDent commented 3 months ago

Ok, cool! I'll try to find out how to build go projects from source.

There is only one gap that would be too small to fit into the 15 min window:

2024-03-21T19:26:06.043882+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-21T19:32:22.478564+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...

All the other gaps are at least one hour long.

This I found weird. The system had resumed from hybrid-standby, backrest had executed the plans but then backrest suddenly stops working for almost six hours until the system enters hybrid-sleep again:

2024-03-20T09:51:35.599982+01:00 Hostname backrest[1663]: 2024-03-20T09:51:35.599+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "index snapshots for plan \"restic-backrest-raspi2\"", "duration": "818.363871ms"}
2024-03-20T15:29:44.999662+01:00 Hostname systemd[1]: Starting systemd-hybrid-sleep.service - Hybrid Suspend+Hibernate...
ArthurusDent commented 3 months ago

Ok, looks like I managed to build the binary and installed it correctly. Will start to monitor over the next few days.

garethgeorge commented 3 months ago

Hey, thanks for giving this a go! I was going to say that I'd also aim to get a release out this week with the logs, but seeing as you got it building if there are any insights from new logs I'd be really happy to try to patch this for the next release.

So far I've spent some time staring at the scheduling logic but I'm not quite sure what could be going on w/your suspend / resume case. It'll be very telling if we see that tasks are getting enqueued in the scheduler but just never fire after the suspend, that'll confirm something's wrong with how timers are firing.

ArthurusDent commented 3 months ago

It's currently 11:00 PM/23:00 on 2024-03-25 and backrest has stopped working.

I have two systems where this has happened with the dev-snapshot in place but this post is only about System1. System2 I might post tomorrow.

backrest-System1.log

System1 most likely has been using the dev-snapshot from this point in time on:

2024-03-24T15:57:29.332245+01:00 System1 systemd[1]: Started backrest.service - Backrest Service.

System1 resumed at 3:43 PM:

2024-03-25T15:43:10.883609+01:00 System1 kernel: [69032.988514] PM: hibernation: hibernation exit

Backrest then immediately ran the two Home-plans that it might have scheduled the evening before but wasn't able to run because the system entered hybrid-sleep before they could be executed. After both of those plans ran, neither were the hourly Home plans executed again, nor were the two daily System plans being executed (screenshots/logs).

Schedule: Home-Local: 00:06 (hourly) Home-RaspberryPi2: 00:16 (hourly) System-Local: 10:26 PM/22:26 (daily) System-RaspberryPi2: 10:46 PM/22:46 (daily)

I haven't looked at the logs very thoroughly, though. Maybe you can still find something interesting.

grafik

grafik

grafik

grafik

garethgeorge commented 3 months ago

Okay, my leading theory at this point is some kind of deadlock in the scheduler. Thanks for providing all of those logs, they're definitely illuminating. The tasks are all getting scheduled which means the scheduler itself is becoming hung at somepoint. I wonder if related to the hybernation or if it's just coincidental -- it may actually be triggered by the time jump.

Try starting backrest with GOTRACEBACK=all set as an env variable, next time you see it hung terminate with SIGQUIT and let's try to get a dump of the inflight coroutines. https://stackoverflow.com/questions/19094099/how-to-dump-goroutine-stacktraces shows a bit of how to do this.

In the mean time, I'm going to try to add some new tests that might cover the situation you're running into (and also see if I can replicate by just suspending a backrest process for some duration of time).

Thanks again for working with on me to investigate here!

ArthurusDent commented 3 months ago

Happy to help!

I'll take a look at enabling traceback.

Here is something unexpected that happened yesterday on System2. Initially it was also stuck but then the auth cookie seemed to have expired and I had to re-login. This made backrest unstuck and it is still executing the backup plans the next day after another hybrid-sleep.

2024-03-26T08:59:29.785751+01:00 Hostname systemd[1]: Stopped target hybrid-sleep.target - Hybrid Suspend+Hibernate.
2024-03-26T10:42:27.127755+01:00 Hostname backrest[8892]: 2024-03-26T10:42:27.127+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.573821+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.573+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.921904+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.921+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:29.958888+01:00 Hostname backrest[8892]: 2024-03-26T10:42:29.958+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:30.075747+01:00 Hostname backrest[8892]: 2024-03-26T10:42:30.075+0100#011#033[35mDEBUG#033[0m#011login request#011{"username": "default"}
2024-03-26T10:42:30.075966+01:00 Hostname backrest[8892]: 2024-03-26T10:42:30.075+0100#011#033[33mWARN#033[0m#011failed login attempt#011{"error": "user not found"}
2024-03-26T10:42:34.919477+01:00 Hostname backrest[8892]: 2024-03-26T10:42:34.918+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:39.914640+01:00 Hostname backrest[8892]: 2024-03-26T10:42:39.914+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:45.063569+01:00 Hostname backrest[8892]: 2024-03-26T10:42:45.063+0100#011#033[33mWARN#033[0m#011auth middleware blocked bad JWT: parse token: token has invalid claims: token is expired
2024-03-26T10:42:49.771088+01:00 Hostname backrest[8892]: 2024-03-26T10:42:49.770+0100#011#033[35mDEBUG#033[0m#011login request#011{"username": "xxx"}
2024-03-26T10:43:36.837565+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.836+0100#011#033[34mINFO#033[0m#011running task#011{"task": "collect garbage"}
2024-03-26T10:43:36.850289+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011collecting garbage#011{"forgotten_snapshots": 81, "operations_removed": 0}
2024-03-26T10:43:36.850342+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011task finished#011{"task": "collect garbage", "duration": "13.289862ms"}
2024-03-26T10:43:36.850368+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011scheduling task#011{"task": "collect garbage", "runAt": "2024-03-27T10:43:36+01:00"}
2024-03-26T10:43:36.850392+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.850+0100#011#033[34mINFO#033[0m#011running task#011{"task": "backup for plan \"Home-Raspi2\""}
2024-03-26T10:43:36.856323+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.856+0100#011#033[34mINFO#033[0m#011Starting backup#011{"plan": "Home-Raspi2", "opId": 1794443628301517874}
2024-03-26T10:43:36.856371+01:00 Hostname backrest[8892]: 2024-03-26T10:43:36.856+0100#011#033[35mDEBUG#033[0m#011repo orchestrator starting backup#011{"repo": "restic-backrest-raspi2"}

backrest_hibernation_system2.log

ArthurusDent commented 3 months ago

Looks like I was able to enable tracebacks but your suggestion with the environment variable needed a bit extra work. The environment variable needed to be added with systemctl edit backrest.service.

Content:

[Service]
Environment="GOTRACEBACK=all"

If I then kill backrest with kill -SIGQUIT %jobID it doesn't quit with "core dump" but it seems to do write something like it into the log file.

Please take a look if this is what you needed. SIGQUIT happened when backrest had gotten stuck. backrest_stuckt_with-traceback.log

garethgeorge commented 3 months ago

Thanks for posting this log! That's fantastic. Looks like it's blocked on Dequeue in the orchestrator, what's interesting is that the queue has only been stuck for ~3 minutes in that log (which is the polling interval) which implies to me that it's still looping but it's not getting any tasks from the queue. This doesn't make much sense seeing as when you post screenshots there's clearly a task scheduled that should also be present in the operation queue. Going to see if I can figure out how that might be happening.

benallfree commented 3 months ago

Tracking this on OSX as well. #189

garethgeorge commented 2 months ago

Significantly simplified the task queue implementation in https://github.com/garethgeorge/backrest/commit/8b9280ed57b84b7da814e285542c57b7c14209ae , the refactoring generalizes the implementation and breaks it up for more modular test coverage as the previous implementation was very hard to understand.

Hopeful this (and other improvements around scheduling) will have some impact here.

ArthurusDent commented 2 months ago

Ok, cool! Let me know when you need me to re-build and test.

garethgeorge commented 2 months ago

snip: found some significant bugs in the revised orchestrator, reverting it for now. Going to keep iterating on this.

garethgeorge commented 2 months ago

Released the new task queue implementation in 0.17.2 which fixes some causes of deadlock that existed in the original implementation. I'd be interested to have you give it a go @ArthurusDent and let me know if you're still seeing backrest stop scheduling.

benallfree commented 2 months ago

@garethgeorge FYI I'm testing this on OSX as well :)

ArthurusDent commented 2 months ago

Released the new task queue implementation in 0.17.2 which fixes some causes of deadlock that existed in the original implementation. I'd be interested to have you give it a go @ArthurusDent and let me know if you're still seeing backrest stop scheduling.

Ok, going to test on my two systems. Results will need a few days.

benallfree commented 2 months ago

@garethgeorge After some OSX testing, I can report that it still fails to resume the schedule after sleep. Back up now also has no effect. Restarting the daemon process fixes everything, and a normal SIGINT works (not Force Quit).

After restarting, the previously queued waiting items were cleared and new ones were created. Then everything started working normally.

ArthurusDent commented 2 months ago

Same thing here, backrest still got stuck after a few days. Problem isn't fixed.

garethgeorge commented 2 months ago

Darn, that's disappointing. If either of you get more logs (or traces to post) that'll continue to be helpful. One "cludge" (in my opinion) work around would be for me to add a monitor thread that configures backrest to restart itself every 24 hours or so (if no operations have run recently) -- but I feel it should be possible to make the scheduling robust enough that that isn't required.

I'd like to dig into a few more logs samples. The last few releases have narrowed down the possibilities re: what could be going on here and have added new logs that should give me a better understanding of what the scheduler is doing.

benallfree commented 2 months ago

@garethgeorge I'm happy to provide OSX logs if you can guide me to it 👍

Come to think of it, some sort of log copy/paste from the UI would be a great feature. (Ideally stripping out sensitive info). That way you could add whatever logging you need and users could send it to you.

ArthurusDent commented 2 months ago

Here is the log from my work PC: backrest_hybrid-sleep-v0.17.2.log

kytta commented 1 month ago

Hey there; I've just started using Backrest v1.1.0 on macOS 13.6 Ventura (via Homebrew), and I think I also have this issue. A backend of mine scheduled to run every hour did not run while the laptop was asleep, and it didn't run after being woken up. The operation would say, e.g., "8:00 pending", even if it's long after 8. "Editing" the settings (pressing Save without changes) or restarting Backrest help.

@garethgeorge you said you need logs and stuff; will the default log be enough? Do you want me to compile backrest with some debug options?

nbca2 commented 2 weeks ago

Hey there; I've just started using Backrest v1.1.0 on macOS 13.6 Ventura (via Homebrew), and I think I also have this issue. A backend of mine scheduled to run every hour did not run while the laptop was asleep, and it didn't run after being woken up. The operation would say, e.g., "8:00 pending", even if it's long after 8. "Editing" the settings (pressing Save without changes) or restarting Backrest help.

Hey there, i also have the same problem described above!

buddydvd commented 5 days ago

This issue is likely related #315.

My workaround is to close all browser tabs to Backrest before letting the computer go to sleep.

garethgeorge commented 5 days ago

Thanks for the callout and agree re: #315 likely related, it's worth note that this is fixed as of version 1.2.0 .