borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.11k stars 741 forks source link

FileExistsError/FileNotFoundError when locking #5828

Closed pgerber closed 3 years ago

pgerber commented 3 years ago

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

bug

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

$ borg -V
borg 1.1.16

Operating system (distribution) and version.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

Hardware / network configuration, and filesystems used.

BTRFS on target host

Describe the problem you're observing.

I've been seeing locking-related errors for some time. See backtraces below. So far, I've only seen this issue on one out of two servers. The only difference I'm aware of is that the machine is using BTRFS as filesystem.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

No, but the issue reappears sporadically.

Include any warning/errors/backtraces from the system logs

Traceback (most recent call last):
  File "borg/locking.py", line 130, in acquire
FileExistsError: [Errno 17] File exists: '/backup/borg_repos/backup01/lock.exclusive'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "borg/remote.py", line 247, in serve
  File "borg/remote.py", line 375, in open
  File "borg/repository.py", line 190, in __enter__
  File "borg/repository.py", line 421, in open
  File "borg/locking.py", line 350, in acquire
  File "borg/locking.py", line 363, in _wait_for_readers_finishing
  File "borg/locking.py", line 134, in acquire
  File "borg/locking.py", line 159, in kill_stale_lock
FileNotFoundError: [Errno 2] No such file or directory: '/backup/borg_repos/backup01/lock.exclusive'

Borg server: Platform: Linux backup02 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Borg server: Linux: debian bullseye/sid 
Borg server: Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
Borg server: PID: 1808518  CWD: /backup/borg_repos/backup01
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/backup/borg_repos/backup01']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Platform: Linux backup01 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Linux: debian bullseye/sid 
Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
PID: 67423  CWD: /root
sys.argv: ['borg', 'create', '--error', '--compression', 'zstd', '--lock-wait', '43200', '--noatime', '--patterns-from', '/etc/borg/patterns-main', '--', 'borg-backup01@backup02.tocco.ch:~::main-{now:%Y-%m-%dT%H:%M:%S}']
SSH_ORIGINAL_COMMAND: None
Traceback (most recent call last):
  File "borg/locking.py", line 130, in acquire
FileExistsError: [Errno 17] File exists: '/backup/borg_repos/backup01/lock.exclusive'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "borg/remote.py", line 247, in serve
  File "borg/remote.py", line 375, in open
  File "borg/repository.py", line 190, in __enter__
  File "borg/repository.py", line 421, in open
  File "borg/locking.py", line 350, in acquire
  File "borg/locking.py", line 363, in _wait_for_readers_finishing
  File "borg/locking.py", line 134, in acquire
  File "borg/locking.py", line 159, in kill_stale_lock
FileNotFoundError: [Errno 2] No such file or directory: '/backup/borg_repos/backup01/lock.exclusive'

Borg server: Platform: Linux backup02 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Borg server: Linux: debian bullseye/sid 
Borg server: Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
Borg server: PID: 1808518  CWD: /backup/borg_repos/backup01
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/backup/borg_repos/backup01']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Platform: Linux backup01 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Linux: debian bullseye/sid 
Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
PID: 67423  CWD: /root
sys.argv: ['borg', 'create', '--error', '--compression', 'zstd', '--lock-wait', '43200', '--noatime', '--patterns-from', '/etc/borg/patterns-main', '--', 'borg-backup01@backup02.tocco.ch:~::main-{now:%Y-%m-%dT%H:%M:%S}']
SSH_ORIGINAL_COMMAND: None
ThomasWaldmann commented 3 years ago

Do you have scripts running that use borg break-lock?

Do you have multiple borg client processes running against that one repo? At the same time?

ThomasWaldmann commented 3 years ago

Analyzing this:

So, as ELD is gone in 159 then something must have removed it (which is why I was asking for break-lock and other borg processes which maybe raced for deleting a stale lock).

ThomasWaldmann commented 3 years ago

Hmm, no matter how this happened, I think you have found a bug here.

borg should not crash when trying to look into a not-any-more-existing exclusive lock directory.

ThomasWaldmann commented 3 years ago

Note: master branch locking code did get recently get improvements/fixes by @alfredo08154711 and it already handles this case.

alfredo08154711 commented 3 years ago

Hi,

here is alfredo08154711 writing. It's not clear to me whether there is some action required from my side.

BTW, it's also not clear to me into which versions my fixes have been merged. I only know that they have been merged into the master branch about one year ago.

And I'm still waiting on a discussion of my suggested contribution, option --shared for borg with-lock (a PR, IIRC), "Lock the repository with a shared (read) instead of an exclusive (write) lock." I've been using this for my repository rsync copying (redundancy for business continuity) for one year without any observed issues.

Cheers Thomas

Am 9. Juni 2021 14:15:07 MESZ, schrieb TW @.***>:

Note: master branch locking code did get recently get improvements/fixes by @alfredo08154711 and it already handles this case.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/borgbackup/borg/issues/5828#issuecomment-857643406

-- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

ThomasWaldmann commented 3 years ago

hi @alfredo08154711 - no action strictly required, maybe you want to have a look in the PR.

yes, your fixes are only in master branch (and thus in borg 1.2).

i first wanted to have that tested there before doing big changes in 1.1 locking. not sure how much testing the 1.2 betas got until now - a lot people might hold back and use a stable release only rather than additionally testing a beta.

about the with-lock --shared PR: will try to review it, but I'ld also like other users/developers to review that and give feedback.

do you mean #4120 ?

alfredo08154711 commented 3 years ago

Vielen Dank, Herr Waldmann.

Eigentlich hätte ich Sie mal gern persönlich kennengelernt - als ich mit den Fixes begann, hatte ich noch einen Job bei AIRBUS in Ulm und pendelte daher wöchentlich durch Ihre Gegend (ich wohne 8km von Aschaffenburg entfernt Richtung Frankfurt). Aber wegen der Bundescorontäne habe ich von der Idee, Sie mal zu treffen, Abstand genommen.)

Aber was nicht ist, kann ja noch werden...

Viele Grüße Thomas Portmann

Am 9. Juni 2021 14:50:34 MESZ, schrieb TW @.***>:

hi @alfredo08154711 - yes, your fixes are only in master branch (and thus in borg 1.2).

i first wanted to have that tested there before doing big changes in 1.1 locking. not sure how much testing the 1.2 betas got until now - a lot people might hold back and use a stable release only rather than additionally testing a beta.

about the with-lock --shared PR: will try to review it, but I'ld also like other users/developers to review that and give feedback.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/borgbackup/borg/issues/5828#issuecomment-857666406

-- Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

pgerber commented 3 years ago

Do you have scripts running that use borg break-lock?

No, break-lock is not used.

Do you have multiple borg client processes running against that one repo? At the same time?

Yes.

According to the crontab file two jobs are actually started simultaniously:

# create backup "main" at borg-backup01@backup02.tocco.ch:~
30 21 * * * root  borg create --error --compression zstd --lock-wait 43200 --noatime --patterns-from /etc/borg/patterns-main -- 'borg-backup01@backup02.tocco.ch:~'::main-{now:\%Y-\%m-\%dT\%H:\%M:\%S}

# prune backup "main" at borg-backup01@backup02.tocco.ch:~
30 20 * * * root  borg prune 'borg-backup01@backup02.tocco.ch:~' --lock-wait 43200 --keep-within 7d --keep-daily 7 --keep-weekly 4 --prefix main-

# create backup "postgres" at borg-backup01@backup02.tocco.ch:~
02 22 * * * root  borg create --error --compression zstd --lock-wait 43200 --noatime --patterns-from /etc/borg/patterns-postgres -- 'borg-backup01@backup02.tocco.ch:~'::postgres-{now:\%Y-\%m-\%dT\%H:\%M:\%S}

# prune backup "postgres" at borg-backup01@backup02.tocco.ch:~
30 20 * * 1-5 root  borg prune 'borg-backup01@backup02.tocco.ch:~' --lock-wait 43200 --keep-daily 7 --keep-weekly 4 --prefix postgres-

# check repo at borg-backup01@backup02.tocco.ch:~
00 23 * * 0 root  borg check --lock-wait 43200 'borg-backup01@backup02.tocco.ch:~' >/dev/null 2>&1

# intended for use by monitoring and health checks
0 12 * * * root  borg info --glob '*' --lock-wait 43200 --last 300 --json 'borg-backup01@backup02.tocco.ch:~' >/var/lib/tocco-borg/info-1.tmp && mv /var/lib/tocco-borg/info-1.tmp /var/lib/tocco-borg/info-1

I guess that could explain why it only happens with one server. I can't see any other simultaneous executions of borg.

Note: master branch locking code did get recently get improvements/fixes by @alfredo08154711 and it already handles this case.

I noticed the errors a while back and noticed that refactoring of the locking mechanism being ongoing. So, I refrained from reporting the issue under the assumption that I'd likely be fixed by the change. I assumed the change went into 1.1.16 but I guess was wrong about that.

I'm thinking I should just update the one server to borg 1.2 and see if that helps. Considering that only the secondary backup server is affected, the risk should be manageable.

ThomasWaldmann commented 3 years ago

OK, 1.1 fixed recently, master fixed since longer, all done here.

@pgerber 1.2 still in beta, be careful. testing is very welcome of course.