Closed halfgaar closed 2 months ago
What result do you expect by using CTRL-C while the process is running?
Ctrl-C is a request to finish nicely. In the Borg 1.2 release notes:
create: First ctrl-c (SIGINT) triggers checkpoint creation and then aborts.
This commit message adds support for it. Also to carefully wrap up the current deletion, if that's what it's doing.
This normally happens well, but there are occasions, mostly seen with new archives, that crash like the above.
For reference, this is the normal interrupt output of a small test (replaced 'clear line' char with 'newline', so it's readable):
0 B O 0 B C 0 B D 0 N src
Initializing cache transaction: Reading config
Initializing cache transaction: Reading chunks
Initializing cache transaction: Reading files
715.15 kB O 715.15 kB C 0 B D 0 N src/big
17.74 MB O 17.74 MB C 0 B D 0 N src/big
32.51 MB O 32.51 MB C 0 B D 0 N src/big
46.96 MB O 46.96 MB C 0 B D 0 N src/big
59.21 MB O 59.21 MB C 0 B D 0 N src/big
76.95 MB O 76.95 MB C 0 B D 0 N src/big
97.65 MB O 97.65 MB C 0 B D 0 N src/big
123.05 MB O 123.05 MB C 0 B D 0 N src/big
137.59 MB O 137.59 MB C 0 B D 0 N src/big
147.16 MB O 147.16 MB C 0 B D 0 N src/big
168.24 MB O 168.24 MB C 0 B D 0 N src/big
194.63 MB O 194.63 MB C 0 B D 0 N src/big
Saving files cache
Saving chunks cache
Saving cache config
Initializing cache transaction: Reading config
Initializing cache transaction: Reading chunks
Initializing cache transaction: Reading files
195.97 MB O 195.97 MB C 0 B D 0 N src/big
(ctrl-c)
Saving files cache
Saving chunks cache
Saving cache config
Initializing cache transaction: Reading config
Initializing cache transaction: Reading chunks
Initializing cache transaction: Reading files
209.72 MB O 209.72 MB C 0 B D 1 N src/big
Got Ctrl-C / SIGINT.
This process can take a few dozen seconds sometimes.
That TypeError
in Python stdlib signal:39
(_enum_to_int
) seems to be a cosmetic thing, considering there is an exception handler dealing with that TypeError
:
https://github.com/python/cpython/blob/3.9/Lib/signal.py#L39
Wondering why we see handler
twice in the tracebacks - that looks like SIGINT was fired twice and the 2nd firing was while the first one was still being processed (was still inside __exit__
).
I can absolutely guarantee that's not the case. Here's my history:
I normally ctrl-c like a maniac. That's because often I abort shell scripts or loops with subprocesses. So, the first time I used Borg I did that too. This triggered a problem like this, and then I started investigating how Borg deals with that. Then I found it was documented behavior that one ctrl-c aborts gracefully, and two is harder. I've been using that method regularly.
But, it seems that under certain conditions, and it looks like it has to do with it being a new archive or no checkpoint being present yet, this crash happens?
I tried to reproduce with borg 1.2.7 (built from git tag) on macOS. I used a user@localhost:/tmp/repo12
repo, so it used the remote.py
code and ssh, but it always behaved as it should (about 10 attempts), no tracebacks.
@halfgaar Sure, I believe you. Maybe there is a bug somewhere. The generator it complains about is in the remote.py
code part, so maybe that only happens with "remote" (ssh) repos.
I am trying to reproduce now on a ubuntu system, let's see if that behaves differently.
I doubt that it has to do with the presence of checkpoints. borg always starts a fresh archive when using borg create
. It does not really "continue from where it was", it only feels like that due to the deduplication and the chunks already present in the repo.
OK, tried on Ubuntu Mantic (23.10) using Python 3.11.6 and borg 1.2.7 (from git tag) with a ssh: repo (with and without prior stuff in it): it also behaved as it should, not tracebacks.
I sshed into the ubuntu machine, so my ctrl-c keypress was transmitted via ssh to the borg client (not on a local text console or a gui terminal).
I see you always used python 3.9.18 on 2 different ubuntu versions - can you retry with some more recent python? Or just use Ubuntu's system python3?
Also, what's the borg version on the repo side? In my case it was both the same 1.2.7 version (although the traceback does not look like the remote borg version should have an influence on that).
Another try:
On Ubuntu Mantic (23.10) using Python 3.11.6 and borg 1.2.7 - this time installed using pip install borgbackup
into a fresh venv. Also with a ssh: repo (with and without prior stuff in it): it also behaved as it should, not tracebacks.
@halfgaar Do you use a borg fat binary downloaded from github releases page? If so, please give me the link to the binary you use and the sha256sum of the binary on your system.
https://github.com/borgbackup/borg/releases/download/1.2.7/borg-linux64 I tried to reproduce using that file and interestingly I sometimes got "Keyboard interrupt" (looks like the default python handler) and sometimes "Got Ctrl-C" (borg's checkpoint-saving handler), never a traceback though.
OK, so I guess this happens when one uses a pyinstaller binary (which creates a bootloader process and a borg process, both being in same process group):
So the borg process ends up getting SIGINT twice.
About adding --bootloader-ignore-signals
to the pyinstaller commandline:
I the fat binary, sha256 hash d53081df3926ecb631610a99218766505b13ccf9cd5cbeae10da5ada826f5925
. Version 1.2.7.
I completely forgot about the fact that my target server is Debian 12, with Python 3.11.2. My source machines are stock Ubuntu, with Python Python 3.8.16 and Python 3.10.12.
Does the 3.9.18 come from the fat binary?
Here's another different result:
$ borg create --stats --progress 'halfgaar@localhost:/tmp/borgtest/myrepo::bla-{now}' src/
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:xTw9vxM1KjAAJ88sOHuoDHKMmJuSddNB9enepjI5mLI.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Remote: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Warning: The repository at location ssh://halfgaar@localhost/tmp/borgtest/myrepo was previously located at /tmp/borgtest/myrepo
Do you want to continue? [yN] y
^CRemote: Borg 1.2.7: Got unexpected RPC data format from client.
Connection closed by remote host
I've been trying to reproduce again, but now I can't. I don't know what I'm doing differently.
But I see you may have a culprit.
Yes, the py39 comes bundled within the fat binary (that's how I noticed you likely use that). In my most recent experiment, I used the same binary as you use on your machine.
Not sure yet how a good solution for this issue would look like. It could be avoided by not using that binary, because the cause is the binary's "bootloader" process.
Guess a solution would involve not calling self.__exit__
from the handler, because by doing that, we would revert the SIGINT handler to python's default one and thus lose control over it.
If then a duplicate SIGINT comes in a microsecond later, our handler won't be called any more and it'll terminate the process like if we pressed Ctrl-C more than once.
See #8162 - I tested it with a fresh pyi-made binary and it "debounced" successfully!
Sounds great. I'll be busy/gone for a while the coming days so won't be able to test more, but it looks like it's all done practically.
I've had this with Borg 1.2.6 and now with 1.2.7 again. Interrupting it (on the initial, non-finished backup) results in a crash and even a crash in a crash. I did ctrl-c once, not multiple times.
Ubuntu 20.04 with Borg 1.2.6:
This is Ubuntu 22.04 with Borg 1.2.7:
And another variant on Ubuntu 22.04 with Borg 1.2.7:
Systems are plain ext4, backing up over SSH to encrypted and unencrypted repo.