Snapshotting should wait for user-callback to prepare/mount target filesystem before throwing errors

bit-team / backintime

Back In Time - An easy-to-use backup tool for GNU/Linux using rsync in the back

https://backintime.readthedocs.io

GNU General Public License v2.0

2.09k stars 204 forks source link

Snapshotting should wait for user-callback to prepare/mount target filesystem before throwing errors #1142

Open kortschak opened 3 years ago

kortschak commented 3 years ago

I have a Back in Time profile that stores to a NAS that is powered down for most of the time. The NAS is set to wake with a WOL packet that is sent by a user-callback which waits after the WOL packet has been sent then returns a success status if the NAS comes up within the timeout. This all works very nicely since #654 was fixed.

However, the backup to the NAS is scheduled for the middle of the night and often in the morning I see a notification like so

notify

despite the backup having succeeded. This is due to this https://github.com/bit-team/backintime/blob/9310acc30362907c7d14f3c8c31a0e4a366d44f5/common/snapshots.py#L669-L674 which makes sense in the context of attended backups, but not in the context of unattended scheduled backups where there is no user to follow the instructions. In the later context it leaves doubt in the mind of the user as to whether the backup succeeded. The issue appears to arise due to a race between backintime and the remote being mounted by the system since the user-callback waits until the NAS is up before it returns. backintime iterates a test for the presence of the remote mount each second, but first notifying the user that the remote mount is not present.

In looking into this issue, I found that backintime also logs as an ERROR to syslog for checks where the remote mount is not present (actually it checks whether it is not a directory): https://github.com/bit-team/backintime/blob/9310acc30362907c7d14f3c8c31a0e4a366d44f5/common/config.py#L1356-L1360 This results in misleading ERROR log lines in syslog.

Mar 22 01:00:02 client backintime (user/2): INFO: Lock
Mar 22 01:01:52 client backintime (user/2): INFO: user-callback returned 'user-callback: 2021/03/22 01:00:20 sending wake packet#012user-callback: 2021/03/22 01:01:52 server ready'
Mar 22 01:01:53 client backintime (user/2): INFO: mount ssh: user@server.local:/data/backups on /home/user/.local/share/backintime/mnt/XXXXXXXX/mountpoint
Mar 22 01:01:53 client backintime (user/2): INFO: Take a new snapshot. Profile: 2 remote
Mar 22 01:01:53 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:54 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:57 client backintime (user/2): INFO: Call rsync to take the snapshot
Mar 22 02:23:40 client backintime (user/2): WARNING: Command "rsync --recursive ...
Mar 22 02:23:40 client backintime (user/2): WARNING: CONTINUE: --exclude...
Mar 22 02:23:40 client backintime (user/2): INFO: Save config file
Mar 22 02:23:40 client backintime (user/2): INFO: Save permissions
Mar 22 02:26:27 client backintime (user/2): INFO: Create info file
Mar 22 02:26:31 client backintime (user/2): INFO: [smart remove] remove snapshots: [20210315-010002-225]
Mar 22 02:46:50 client backintime (user/2): INFO: unmount ssh: user@server.local:/data/backups from /home/user/.local/share/backintime/mnt/XXXXXXXX/mountpoint
Mar 22 02:46:50 client backintime (user/2): INFO: Unlock

ISTM that the check here should be a silent check for whether the file exists (if it doesn't after the 30 seconds of checking has elapsed it does finally error out) and only if it does exist then check that it's a directory, erroring out if not. The fix for this is simple and I will send a PR.

The fix for the underlying issue is less clear to me; ISTM that perhaps the code should be aware of whether the backup has been invoked by a user or is the result of an unattended schedule, only asking for user input if there is a user likely to be present (i.e. not in the case of scheduled backup).

colintedford commented 3 years ago

perhaps the code should be aware of whether the backup has been invoked by a user or is the result of an unattended schedule, only asking for user input if there is a user likely to be present (i.e. not in the case of scheduled backup).

The user may be present even in the case of scheduled backup (for example if they backup hourly).

kortschak commented 3 years ago

Yes, this is why I felt this needed more discussion. Perhaps an *.schedule.unattended boolean configuration would make sense?

buhtz commented 2 years ago

@kortschak @colintedford There is a new forming maintaining team and we do review all issues. Is this problem still relevant for you, can you reproduce it with a newer or the latest release? Did you find a solution?

Tag: Bug

This depends on a refactoring of the whole logging mechanic. If I understood the implications correct this is not a blocking or heavy bug. That bug can't be fixed isolated. Because of that I vote for close. No matter the logging mechanic will be recreated using Pythons own logging package. This will take time and won't happen in the next release because it implicates some breaking changes.

kortschak commented 2 years ago

This is still relevant. I run the version at https://github.com/bit-team/backintime/pull/1143 and have not moved from there because of the impact that that bug has on my use. I don't see any changes between that branch and master that would affect the behaviour here (the relevant code in config.py has been touched, but not in a way that would fix this).

All the requirements to fix the issue are described in the issue, but depend on design decisions, which is why I did not send a fix. I don't think that the bug should be closed until the issue no longer exists, and I would say that there is a fundamental issue of logging non-error states as errors due to invalid logic, not a logging issue.

buhtz commented 2 years ago

Please let me paraphrase this to see if I understand it.

Your backup destination is a NAS that is not online 24/7
Your backup/snapshot job has a user-callback script that use wake-on-lan to wakeup the NAS before the backup starts.

Does your script wait until the NAS has booted (e.g. via checking via ping?). Looking at the time code of your log output it doesn't wait.

No matter if that error appears or not your backups are done?

I run the version at https://github.com/bit-team/backintime/pull/1143 and have not moved from there because of the impact that that bug has on my use.

What is the "impact" in your use-case? It sounds like you can't do backups without that PR? Am I right?

colintedford commented 2 years ago

Not relevant to me now because I'm not currently using backintime. I'm glad to hear about the team, tho.

kortschak commented 2 years ago

Your backup destination is a NAS that is not online 24/7

Correct.

Your backup/snapshot job has a user-callback script that use wake-on-lan to wakeup the NAS before the backup starts.

This is the case. There is a callback to this program https://github.com/kortschak/bit-user-callback/blob/master/bit-user-callback.go using the following configuration

{
  "iwconfig-path": "/sbin/iwconfig",
  "verbose": false,
  "profile": "remote",
  "essid": "SSID",
  "server": "http://nas.host.address/",
  "wake-mac": "xx:xx:xx:xx:xx:xx",
  "wake-delay": "20s",
  "wake-timeout": "10m0s",
  "wake-remote": "255.255.255.255:9"
}

You can see here that the wake code waits 10 minutes for the NAS to come up before failing out (this is this loop). The NAS takes less than 2 minutes to boot.

No matter if that error appears or not your backups are done?

I think that is correct; this issue is over a year old, so the details are fuzzy.

What is the "impact" in your use-case? It sounds like you can't do backups without that PR? Am I right?

Error logs result in system notifications. BIT without this change can cause significant spamming of notifications that make the Gnome notification bar useless (in pathological cases causing Gnome to become unresponsive — this doesn't happen in this case, but can when other errors occur, this is a different (but related) and bigger issue that may be fixed by your plans to refactor the logging system if you keep track of the number of ERROR level log messages that you send).

buhtz commented 2 years ago

Maybe my next question is because I'm not deep enough into the BIT code.

When your user-callback does wait 10 minutes the NAS is up by guarantee.

Then why does the error message happens? Doesn't BIT wait until the end of the user-callback script?

kortschak commented 2 years ago

There are two issues here, the first is shown in the screenshot and is a result of the code in the first block, this ~always happens and is not an error, but spams the notifications. I don't know why it fails to find what it is looking for, but my first guess would be that while the NAS is up and responding to HTTP request, perhaps the FS is slow to respond early. The second is the error logging and I haven't seen that for over a year because I run a version that fixes it; I don't recall the details of when and why it happens.

buhtz commented 2 years ago

Now we come closer.

There are two issues here, the first is shown in the screenshot and is a result of the code in the first block, this ~always happens and is not an error, but spams the notifications.

I understand and agree.

I don't know why it fails to find what it is looking for

That is the important question for me. In my understanding (without reviewing the BIT code) this error shouldn't happen because your user-callback bring up the NAS and does wait long enough.

, but my first guess would be that while the NAS is up and responding to HTTP request, perhaps the FS is slow to respond early. The second is the error logging and I haven't seen that for over a year because I run a version that fixes it; I don't recall the details of when and why it happens.

emtiu commented 2 years ago

Dan, thanks for your patience and clarity in dealing with this issue, and taking the time to explain your troubles and your proposed solutions. First off, let's agree that (as you explained above), you're dealing with two separate annoyances:

1. Spurious ERRORs in syslog

There are errors thrown by the canBackup function: https://github.com/bit-team/backintime/blob/88d19d42e9128f60b8952a64ed1fff23bd28108f/common/config.py#L1356-L1357 which cause an annoyance in your logs, because they indicate a problem that resolves itself after a few seconds. That is discussed at length in https://github.com/bit-team/backintime/pull/1143#issuecomment-1262053796. It's not what this Issue is about.

Moreover, it becomes irrelevant, once we fix …:

2. Spurious desktop notification

You're receiving a desktop notification that says Can't find snapshots folder. If it is on a removable drive […] every 30 seconds, when in fact your user-callback script is still in the middle of waiting for the NAS to come up (which takes about 90 seconds, judging from your logs). This is a race condition, and should be fixed. That's what this Issue is about.

If you agree, I think the title of this Issue should be: Snapshotting should wait for user-callback to prepare/mount target filesystem before throwing errors

… and that would be our new starting point for eliminating the annoyance for you and, hopefully, others :)

emtiu commented 2 years ago

//cc @aryoda, who is looking into improving user-callback functionality.

There's a long-ish discussion here, but you really only need this summary from my above comment:

Spurious desktop notification: You're receiving a desktop notification that says Can't find snapshots folder. If it is on a removable drive […] every 30 seconds, when in fact your user-callback script is still in the middle of waiting for the NAS to come up (which takes about 90 seconds, judging from your logs). This is a race condition, and should be fixed.

aryoda commented 2 years ago

@emtiu I saw some (just merged) PRs for this but I think they "only" do fix the notfication spamming, right?

1329
1328
1143 (directly commited without a PR)

So if there is still work to do I need to understand what exactly.

TLDR;

What exactly should be changed now from your user point-of-view (after the above PRs are already merged in)?

I see only one thing (but I jumped onto this issue very late):

Do not send the "imperative" user notification Can't find snapshots folder. If it is on a removable drive please plug it. from cron jobs (since it is send before the timeout loop which normally heals the missing "mount" and since a user is not expected to read and understand this).

https://github.com/bit-team/backintime/blob/b4dd2d1737c5031d264d667c18f6987a1916762b/common/snapshots.py#L672-L691

Anything missing?

Details

The race condition can only happen if two++ processes are working in parallel, eg.

WOL in the user-callback (incl. mounting the drive if this is done in there)
BiT snapshots.backup() polling for a correctly mounted target drive.

@kortschak

The NAS is set to wake with a WOL packet that is sent by a user-callback which waits after the WOL packet has been sent then returns a success status if the NAS comes up within the timeout.

I assume this is sync call (waits for the NAS beeing up) - I have not checked your GO code (cool script BTW) in details but think it "only" reacts on the "mount" signal 7:

# Taken from user-callback script in GO: https://github.com/kortschak/bit-user-callback/blob/master/bit-user-callback.go
# user-callback ignores profile id and only acts for reason 7 [= mount]

The issue appears to arise due to a race between backintime and the remote being mounted by the system since the user-callback waits until the NAS is up before it returns.

I think if the user-callback for WOL is a sync call and there could be a race between sshd starting on the NAS and BiT trying to mount over SSH, at least I can see in the logs that you are using SSH in BiT:

Mar 22 01:00:02 client backintime (user/2): INFO: Lock
Mar 22 01:01:52 client backintime (user/2): INFO: user-callback returned 'user-callback: 2021/03/22 01:00:20 sending wake packet#012user-callback: 2021/03/22 01:01:52 server ready'
Mar 22 01:01:53 client backintime (user/2): INFO: mount ssh: user@server.local:/data/backups on /home/user/.local/share/backintime/mnt/XXXXXXXX/mountpoint
Mar 22 01:01:53 client backintime (user/2): INFO: Take a new snapshot. Profile: 2 remote
Mar 22 01:01:53 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:54 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:57 client backintime (user/2): INFO: Call rsync to take the snapshot

I think this could be fixed in the user-mount by also waiting for the sshd (ssh server) to be up and then returning from the user-callback.

If BiT calls the user-callback to ask for a mount (reason 7 signal ) it should rely on an available mount to do further processing when the user-callback returns (unless the user-callback returns with an error).

The additional 30 secs timeout for BiT-side mounting of ssh (or whatever is configured in the profile) comes on top and informing the user with an error only after timing out (instead of spamming notifications for the timeout count down) is implemented now via above PRs I think.

ISTM that perhaps the code should be aware of whether the backup has been invoked by a user or is the result of an unattended schedule, only asking for user input if there is a user likely to be present (i.e. not in the case of scheduled backup).

I assume with "user input" you mean the "user notification" in your Gnome notification bar when logged in which shows:

Can't find snapshots folder. If it is on a removable drive please plug it.

If really something different pops up requiring user interaction please provide more details or a screen shot here.

I fully agree that a (head-less) scheduled backup should only write to logs and send user notifications only at the end (and only if a user is logged in).

kortschak commented 2 years ago

I assume this is sync call (waits for the NAS beeing up)

Yes. You can see the core logic for that here https://github.com/kortschak/bit-user-callback/blob/5fc406cfd52bb6279884127e2c10f4726c63c7cf/bit-user-callback.go#L335-L361

The basis for assessing whether the NAS is up is that it responds with a 200 on a get to the web portal for the NAS. A reasonable human would expect that a success here would mean the fs was up, ... but software.

The remainder is there to ensure that the device is at on the correct wireless network and config handling.

think it "only" reacts on the "mount" signal 7:

This is correct.

https://github.com/kortschak/bit-user-callback/blob/5fc406cfd52bb6279884127e2c10f4726c63c7cf/bit-user-callback.go#L323-L325

I assume with "user input" you mean the "user notification" in your Gnome notification bar when logged in

Yes.

I fully agree that a (head-less) scheduled backup should only write to logs and send user notifications only at the end (and only if a user is logged in).

Partially. Logs should be written, but user notification absolutely should not be used in a headless situation. The logs that should be written should be at the appropriate level (I think this is fixed now).

aryoda commented 2 years ago

Logs should be written, but user notification absolutely should not be used in a headless situation.

Yes, in 100 % headless situations a user notification is simply not possible and BiT offers the notification plugin for this headless use case (eg. to send emails).

If a cron job is started under a certain user account and the user is logged-in in a desktop environment when the job finishes I would not consider this as a headless situation (eg. Ubuntu informs me via notifications - when I am logged-in - that some scheduled updates were not possible - eg. snap security updates).

IMHO it would be very helpful for me as BiT user if scheduled backup jobs (in user's cron) gave me a notification if a backup failed (or generally that it was executed and what the result was) so that I am reminded that it worked (or not). Otherwise I had to actively check the logs (if I think of it) or could think backups work but they don't.

Anyhow these are the options I see:

A) Implementing a "cron-zero-notification" change:

A backup started via cron could be recognized in this this code (eg. by setting a transient flag as you proposed earlier): https://github.com/bit-team/backintime/blob/b4dd2d1737c5031d264d667c18f6987a1916762b/common/backintime.py#L767-L771
The only and central code location where user notifications are sent is in the notifyplugin.py. Here all messages could be suppressed if the flag from 1. is set (and maybe redirected to a the log not already implemented).

Note: I was surprised that the GUI-plugin notifyplugin.py is always loaded (even if only backintime CLI was started). This seems inconsistent...

https://github.com/bit-team/backintime/blob/b4dd2d1737c5031d264d667c18f6987a1916762b/common/pluginmanager.py#L206-L210

B) Alternative: Implementing a "if cron then notify only about final backup status" change:

Same as 1. above
All existing pluginmanager.message() calls must be changed to consider the flag from 1. (and future implementations must stick to this but can be forgotten easily due to the de-central nature of this logic).

There are currently only one code location containing this call: snapshots.setTakeSnapshotsMessage():

https://github.com/bit-team/backintime/blob/b4dd2d1737c5031d264d667c18f6987a1916762b/common/snapshots.py#L148

Unfortunately changing this call by considering the flag value from 1. is not the right way since setTakeSnapshotsMessage() is called from about 20 different other locations to send notifications that have either nothing to do with snapshots (eg. "Saving config file", "removing old snapshots") or are informing about a non-final state of taking a snapshot (eg. "Taking snapshot").

I would modify setTakeSnapshotsMessage() by appending a new argument IsFinalResultStatus = False and modify only the calls that represent the final backup result (OK or error results) with IsFinalResultStatus = True.

@kortschak @emtiu @buhtzz Any opinions?

kortschak commented 2 years ago

Yeah, I should clarify, notifications are reasonable, but expecting user interaction is not.

The final outcome notification looks like the best approach to me. I agree that it is easily possible for the user to not notice that a backup has failed otherwise.

emtiu commented 2 years ago

@aryoda wrote:

If BiT calls the user-callback to ask for a mount (reason 7 signal ) it should rely on an available mount to do further processing when the user-callback returns (unless the user-callback returns with an error).

Yes, this is what I consider the core of this issue. If this was done correctly, @kortschak would not be seeing the unwelcome notifications.

But other considerations regarding notifications also come into play:

@aryoda wrote:

IMHO it would be very helpful for me as BiT user if scheduled backup jobs (in user's cron) gave me a notification if a backup failed (or generally that it was executed and what the result was) so that I am reminded that it worked (or not). Otherwise I had to actively check the logs (if I think of it) or could think backups work but they don't.

Yes, this problem (silent failure) has been mentioned in #450, and I've also experienced it myself. Different Issue ;)

@aryoda wrote:

A) Implementing a "cron-zero-notification" change:

I think this is what #850 also wishes for.

On the other hand, users might wish for that notification: It would remind them to connect a drive (for which they have 30 seconds), then run a snapshot, and then the drive might be removed again.

B) Alternative: Implementing a "if cron then notify only about final backup status" change:

Yes, users might very well want to be notified of the final result of a snapshot by desktop notification, but currently, this does not happen.

The "Saving config file" etc. notifications are "BiT-internal" (shown in the GUI and the tray icon), but not as desktop notifications.

In summary, a lot could be done to improve notifications. It would probably be best to have per-profile configuration options like:

"Notify when drive is not connected"
"Notify when snapshot fails"
"Notify when snapshot succeeds".

But this would require a major reworking, including the integration of the If it is on a removable drive please plug it. desktop notification with the "BiT-internal" notifications in the GUI and the tray icon.

For the moment, I think several other issues and bugs are more important. But this Issue gives a good summary.

aryoda commented 2 years ago

For the moment, I think several other issues and bugs are more important. But this Issue gives a good summary.

I agree putting this issue into the backlog (esp. since the plugin and notification systems need a complete overhaul...)

aryoda commented 2 years ago

@emtiu

Yes, users might very well want to be notified of the final result of a snapshot by desktop notification, but currently, this does not happen.

THX, I didn't realize that only selected messages are send as desktop notifications and just checked why...

The notifyplugin does a notify-send only for errors (argument level = 1 - nice example for a "magic number" BTW ;-)

https://github.com/bit-team/backintime/blob/b4dd2d1737c5031d264d667c18f6987a1916762b/qt/plugins/notifyplugin.py#L48-L50

bit-team / backintime

Snapshotting should wait for user-callback to prepare/mount target filesystem before throwing errors #1142

1. Spurious ERRORs in syslog

2. Spurious desktop notification

1329

1328

1143 (directly commited without a PR)

TLDR;

Details