Open symphorien opened 3 years ago
If unison works without systemd, then I suggest you follow up with the systemd project. I can't even tell how you are actually running this (I don't use systemd), how stdin/stdout/stderr end up, etc. Generally when one runs a program as a daemon, when that program normally deals with stdin/stdout, those need to be redirected.
Unless there's evidence of a bug in unison, I'll close this after a bit.
Also feel free to ask for advice on the users list. There are more people there than subscribe to github issues.
Systemd does redirect stdout and stderr: to a unix socket. Here the traceback comes from Stdlib.flush
which more or less maps to write
returning EWOULDBLOCK or EAGAIN and this is perfectly ok for the OS to do that.
OK, then please adjust the issue's title and content to cause the reader to understand what is going wrong in unison, and that this isn't about systemd. Or feel free to close and start a new one.
I'm not sure EWOULDBLOCK is ok if the fd hasn't been set to nonblocking mode (by unison), but I'd have to go read POSIX to see.
This blog post describes a very close issue: https://www.uninformativ.de/blog/postings/2021-05-28/0/POSTING-en.html
systemd launches unison with what is the equivalent of unison 2>&1
, that is the same file descriptor for stderr and stdout.
Unison forks ssh with the same stderr as its own. ssh sets its stderr non-blocking, thus making unison's stdout non-blocking. This in turn causes flush to fail on EAGAIN.
I checked that this is very likely to be the cause:
exec 2>/dev/null
exec $THEREALSSH/bin/ssh "$@"
and then the issue disappeared.
Very interesting post! If this change in flags of the underlying "open file description" that is shared across processes via an inherited file descriptor is correct, then it seems that any program that writes has to be able to cope with nonblocking status for any of the fds that were open when it was executed, or which are not set CLOEXEC when any child was forked. This seems a bit much, and I wonder if that is really what POSIX means.
The test program is very tricky; it seems to need stderr mapped to stdout and a pipe. Doing that, I see the same behavior as the blog post on NetBSD and macOS. So at least there is consistency.
It remains to address the philosophical question of a child being able to set a parent's fd to nonblocking.
I wanted to try a fix for this but I can't reproduce it. Can you provide your systemd service file or a reproducer in any other form?
The linked post does seem to be right about the cause of this issue. I'd just add that mapping stderr to stdout is not even required in this case because Unison outputs a lot to stderr directly (historical reasons... probably should change it, but how many users' expectation would that break?).
Thank you for your interest.
I can only reproduce with openssh 8.6p1, not 8.8p1. Possibly due to https://github.com/openssh/openssh-portable/commit/7be4ac813662f68e89f23c50de058a49aa32f7e4
With openssh 8.6p1, command to reproduce, where /tmp/large_repo should be quite big
repo=$(mktemp -d) ; systemd-run --user --wait -- bash -c "DISPLAY= SSH_AUTH_SOCK=$SSH_AUTH_SOCK unison -auto=true -batch=true -repeat=watch -debug=verbose $repo ssh://localhost///tmp/large_repo"
you must have an ssh-agent running, and added your own ssh key locally with ssh-copy-id localhost
.
Log output is found in journalctl only.
The test program fails for me on NetBSD 9 and current, so it seems we don't have the fix.
I'm going to retitle this into a feature request to work around an ssh bug.
With these instructions, I have managed to reproduce the issue and I wonder if it is worth fixing it or not. It does seem to be an isolated case due to systemd redirecting stdout and stderr to a socket. Also, it seems to be fixed in openssh 8.7 (by the commit referenced by @symphorien).
@gdt which test program are you referring to, how did you test? Are you claiming that the issue appears on non-Linux platforms as well, just by setting stdout/stderr nonblocking?
I used the program in the ssh commit referenced above. There is a bug in openssh where it changes to nonblocking and does not restore.
The systemd issue may be a different one, and if the socket set up by systemd is nonblocking that seems like a systemd bug, but I'm really not clear what's going on.
My understanding is that this issue is valid only for the combination of running as systemd service + openssh < 8.7. The actual cause of our issue is a bug in openssh (changing to nonblocking and not restoring) which has been fixed since 8.7. Even then, normally this openssh bug is not a problem; it only becomes a problem when running Unison as systemd service.
I have not been able to reproduce the issue otherwise. systemd without openssh is OK; openssh without systemd is OK. So, I think this is a very isolated issue (Linux-only, must run as systemd service and have outdated openssh).
That makes sense. I was only trying to say that the test program in the openssh commit shows that the openssh on NetBSD has the issue. Reading back, I now understand that the problem is about unison stdout being nonblocking, and that it takes the combination of systemd and the openssh bug to trigger it.
I guess we can just mark wontfix and if somebody wants to create a workaround we can review it. Until and unless, I guess people can avoid running unison under systemd if they have old openssh.
I have encountered this problem on a fedora system running openssh 9.3. Luckily executing unison outside of systemd rapidly syncs the files solving the problem.
very strange that this happened with openssh 9.3.
Here is the log of the error I was receiving, if it is useful in determining whether maybe it is a different error:
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Uncaught exception Sys_blocked_io
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Raised by primitive operation at Stdlib in file "stdlib.ml" (inlined), line 341, characters 0-54
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.alwaysDisplay in file "/home/runner/work/unison/unison/src/uitext.ml", line 159, characters 2-14
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.display in file "/home/runner/work/unison/unison/src/uitext.ml" (inlined), line 166, characters 34-55
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.interact.loop.(fun) in file "/home/runner/work/unison/unison/src/uitext.ml", line 521, characters 14-26
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.interact in file "/home/runner/work/unison/unison/src/uitext.ml", line 759, characters 5-24
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.synchronizeOnce in file "/home/runner/work/unison/unison/src/uitext.ml", line 1209, characters 6-50
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.synchronizePathsFromFilesystemWatcher.loop in file "/home/runner/work/unison/unison/src/uitext.ml", line 1325, characters 6-106
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.synchronizePathsFromFilesystemWatcher in file "/home/runner/work/unison/unison/src/uitext.ml", line 1350, characters 2-23
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Fatal error: exception Sys_blocked_io
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Raised by primitive operation at Stdlib in file "stdlib.ml" (inlined), line 341, characters 0-54
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.alwaysDisplay in file "/home/runner/work/unison/unison/src/uitext.ml", line 159, characters 2-14
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.display in file "/home/runner/work/unison/unison/src/uitext.ml" (inlined), line 166, characters 34-55
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.interact.loop.(fun) in file "/home/runner/work/unison/unison/src/uitext.ml", line 521, characters 14-26
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.interact in file "/home/runner/work/unison/unison/src/uitext.ml", line 759, characters 5-24
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.synchronizeOnce in file "/home/runner/work/unison/unison/src/uitext.ml", line 1209, characters 6-50
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.synchronizePathsFromFilesystemWatcher.loop in file "/home/runner/work/unison/unison/src/uitext.ml", line 1325, characters 6-106
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.synchronizePathsFromFilesystemWatcher in file "/home/runner/work/unison/unison/src/uitext.ml", line 1350, characters 2-23
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Stdlib in file "stdlib.ml" (inlined), line 341, characters 0-54
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Util.msg.(fun) in file "/home/runner/work/unison/unison/src/ubase/util.ml", line 76, characters 28-40
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Trace.writeLog in file "/home/runner/work/unison/unison/src/ubase/trace.ml", line 172, characters 19-34
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.handleException in file "/home/runner/work/unison/unison/src/uitext.ml", line 1516, characters 8-30
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Uitext.Body.start in file "/home/runner/work/unison/unison/src/uitext.ml", line 1606, characters 6-23
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Main.Body in file "/home/runner/work/unison/unison/src/main.ml" (inlined), line 279, characters 2-281
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net unison[122775]: Called from Linktext.TopLevel in file "/home/runner/work/unison/unison/src/linktext.ml", line 19, characters 18-40
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net systemd[1496]: unison.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 05 14:46:19 ool-4570802f.dyn.optonline.net systemd[1496]: unison.service: Failed with result 'exit-code'.
I also started being affected again a few months ago an reactivated my workaround https://github.com/bcpierce00/unison/issues/605#issuecomment-974712892 with success.
@symphorien @davidoskky if you can, could you test a build from https://github.com/tleedjarv/unison/actions/runs/7790378508 (scroll to bottom to see the builds). I have a potential workaround there but I don't have a chance to test myself.
Keep a backup of your archives if you test on production data. These builds are based on the latest unreleased code and may upgrade the archives.
I can reliably reproduce the error:
mkdir test-unison
cd test-unison
touch test-{1..2000}
The build you provided does not fix the problem.
unison -version
unison version 2.53.3 (ocaml 4.13.1)
Note: some files do get synced with the remote. In this case I get the files test 1 to 332 in the remote.
The build you provided does not fix the problem.
And it makes sense. I simply tried to dup(2) the fd but the reality is that non-blocking status is not set per fd, so no amount of dup(2) will help here.
What would properly fix(?) this issue is to rewrite the code expecting blocking outputs to tolerate non-blocking outputs. That's a bigger task.
Would a potential workaround be the TUI resetting stdout and stderr to blocking status? What kind of unexpected (negative) consequences would it have?
Also, do you know why the issue has reappeared? Has there been a change in either systemd or OpenSSH?
I think the big thing is to figure out what's going on. This was about a bug in openssh versions, and I feel that we more or less said WONTFIX because people should be updating ssh anyway. This may well be a different cause.
I'd like to know the cause, too, but exception Sys_blocked_io
is pretty telling about what's the result. This exception is raised only on EAGAIN and EWOULDBLOCK (in this case on stdout and/or stderr).
We know that it ends up non-blocking. But we don't know why and if this is the same bug recurring in ssh, if @symphorien actually is using old ssh by mistake, if there's some new bug in ssh, some new bug in systemd, or something else.
I can reproduce with a bit more files than @davidoskky
mkdir test-unison
cd test-unison
touch test-{1..7000}
unison-2.53.2 systemd 254.6 OpenSSH_9.6p1
When running unison to sync an empty dir with an old, reasonably large remote repo in a systemd service, I would get unison to hang systematically when printing the diff between the remote and local repo. Sometime I would get a traceback:
Always, unision would exit with code 2
When running unison outside of systemd, sync succeeded immediately. I suspect the problem is outputing too much on stdout and some pipe becomes full ? I don't know and now I can't reproduce anymore as the sync completed...
Redacted command:
Client and server are using the exact same executable of unison (/nix/store/1yrjwlx80n3zd337zlxwaknva82z34yv-unison-2.51.3/bin/unison)
unison 2.51.3, ocaml 4.12, NixOS 21.05 linux x86_64