cfoust / cy

🤖 time travel in the terminal
MIT License
102 stars 6 forks source link

Killing cy leaves a state where cy can not be started again #31

Open xxxserxxx opened 1 week ago

xxxserxxx commented 1 week ago
$ cy
{"level":"panic","error":"daemon: Resource temporarily unavailable","time":"2024-11-04T06:01:16-06:00","message":"failed to daemonize"}
panic: failed to daemonize

goroutine 1 [running]:
github.com/rs/zerolog/log.Panic.(*Logger).Panic.func1({0xd5e4e6?, 0x0?})
        /home/ser/.local/share/go/pkg/mod/github.com/rs/zerolog@v1.29.1/log.go:376 +0x27
github.com/rs/zerolog.(*Event).msg(0xc000213020, {0xd5e4e6, 0x13})
        /home/ser/.local/share/go/pkg/mod/github.com/rs/zerolog@v1.29.1/event.go:156 +0x2aa
github.com/rs/zerolog.(*Event).Msg(...)
        /home/ser/.local/share/go/pkg/mod/github.com/rs/zerolog@v1.29.1/event.go:108
main.startServer({0xc00070b338, 0x14})
        /home/ser/.local/share/go/pkg/mod/github.com/cfoust/cy@v1.1.0/cmd/cy/server.go:159 +0x14d
main.connect({0xc00070b338, 0x14}, 0x1)
        /home/ser/.local/share/go/pkg/mod/github.com/cfoust/cy@v1.1.0/cmd/cy/client.go:228 +0x1ce
main.connectCommand()
        /home/ser/.local/share/go/pkg/mod/github.com/cfoust/cy@v1.1.0/cmd/cy/connect.go:88 +0x71b
main.main()
        /home/ser/.local/share/go/pkg/mod/github.com/cfoust/cy@v1.1.0/cmd/cy/main.go:110 +0x4b6

How to replicate:

  1. Start a terminal
  2. (In my case, I tried to exit, which hung; so I force-closed the terminal, which left a cy process. As I was trying to change the default key binding to one that wouldn't conflict with readline, I wanted to exit the cy server, so ...) Kill -9 the cy process (killall -9 cy)
  3. Try to start cy again

On my system, at least, cy will not start again, and no information about what needs to be cleaned to allow it to work.

cy could crash, or be killed; the session could crash; any number of things could happen that prevent cy from shutting down cleanly. I think what's needed is, rather than just dumping a panic, cy might provide some information to the user about what they need to do, or might need to do, to resolve this.

Environment:

OS: Linux
Terminal: wezterm
cy: installed with go install github.com/cfoust/cy/cmd/cy@latest

Edit: in cleaning the copy/paste to remove shell decorations (PROMPT_RIGHT), I initially deleted the y in cy and "fixed" it with an f. This has been corrected.

xxxserxxx commented 1 week ago

The work-around is to delete /tmp/cy-NNNN, which appears to be the default socket.

I haven't dug into the code deeply; it does look like the code is, in some cases, trying to remove stale sockets. In this case, that branch is either not being hit, or it's failing.

A good solution would be, in the case of this error being hit, to test whether cy is actually running and if not, clean up and retry. Failing that, printing a message suggesting removing the socket if the user is sure cy isn't running would be helpful.

cfoust commented 1 week ago

Thank you for submitting such a detailed report! I'm sorry this is giving you trouble; I'll look into it as soon as I'm able to. Personally I've experienced this myself, but only inconsistently, so I've been waiting to see (a) whether other users run into it and (b) if there's a reliable way to produce it. At the very least the error message could be improved.

xxxserxxx commented 1 week ago

NP. I think an easy way to reproduce it is to start cy, open a new terminal, and killall the cy process. It'll leave behind a stale socket and prevent cy from being restarted. I imagine that in the case of a panic anywhere else in the program -- as hard to reproduce as that may be, or as unlikely -- it'd leave cy in the same state.

cfoust commented 6 days ago

I spent a couple of hours today playing around with this and it's a bit more subtle than it seems. I have not been able to reproduce it reliably on macOS; using killall to kill cy (even with SIGKILL) still does not hinder restarting the server. It's possible this issue is more salient on Linux, so I'll have to give it a try there when I get a chance. The lock files that cause this to happen are unrelated to the cy server itself, they're just used by the daemonization library cy depends on. Theoretically a crash while these lock files are being created could cause them to not release properly, but even by inducing panics in the relevant branches I couldn't get that to happen.

To sum it up, it's going to take some more experimentation to figure out the circumstances in which this can happen (assuming my first attempt on Linux isn't fruitful.) As a result I'm not going to block the 1.2.0 release on this.

I'd be interested in hearing more about your initial report, though. You mentioned that cy hung when you tried to quit the server, is that correct?

xxxserxxx commented 5 days ago

The socket cy is creating on Linux is just a fifo pipe; if the process is killed, there's no magic that cleans it up, and killing (SIGKILL) the process will always leave them behind. Especially since cy sets up no syscall trap (which, to do, would have to be done per-OS). Now, I only dug lightly into the code, but iff a lock socket is left behind, then the cy server process won't start, and neither will cy. IMHO the issue isn't how often cy will encounter a stale socket, but how it reacts when it does: right now, it panics.

I don't think the solution is to bend over backwards trying to make sure cy exits cleanly - that would be nice, but there's always the possibility of a crash at some point. What I believe is needed is code to either fail nicely when the user tries to start cy and the socket is stale -- either by try/catch, or some other more clever mechanism. try/catch would be easiest.

To reproduce this, can you manually create a socket file on OSX when cy is not running and then try to start cy? You should see the panic. I can do this on linux by calling mkdir /tmp/cy-$(id -u) ; mkfifo /tmp/cy-$(id -u)/default. Although, of course now that I'm trying to reproduce it this, I can't :-/

cfoust commented 5 days ago

Took a closer look at this tonight and I found a solution. It turned out that this is unrelated to the cy server's socket.

The procedure cy uses to start up a server looks like this:

  1. Obtain a lock (basically using flock) on an adjacent lock file used to ensure that multiple cy processes do not attempt to start a server with the same identity (e.g. default) simultaneously.
  2. Start a new, separate daemonized cy process that will assume the role of the server.
  3. In the child process that will become the server, create the Unix domain socket used for IPC.
  4. In the parent process, remove the lock file.

The error you provided (Resource temporarily unavailable) comes from the daemonization library (go-daemon) used in step 2, specifically this line. This block creates a flocked .pid file adjacent to the cy socket containing the process ID of the server. When flock would block, this causes the Resource temporarily unavailable error in the parent process, ie the terminal where you originally ran cy.

When I first used go-daemon I unthinkingly enabled this PID file feature thinking that it was a nice-to-have but unlikely to cause issues. In practice it relies on the server process exiting cleanly (and thus running the defer that cleans up the lock) which is obviously a problem, since the lock on the PID file never gets freed.

So the commit I linked just disables this "feature". Perhaps someday it will be useful to restore a (non-locked) .pid file adjacent to the cy socket, but nothing in cy reads from that file, it was just there for user convenience.

In addition, that Panic() really shouldn't have been there; I'm of the mind that panics are mostly a code smell, it was just an oversight that I hadn't corrected after cleaning up cy's subcommands a while back.

This change will be released in cy version 1.3.0. I'd appreciate it if you tried it out early to ensure it's fixed! You can do so by checking out the release/v1.3.0 branch and running go install ./cmd/cy/....

xxxserxxx commented 1 day ago

I must have been looking at the wrong line, or maybe I was looking at the wrong code version -- but I'm glad you found it, and thank you so much for diligently looking for the issue. These hard-to-replicate sorts of things are a PITA to track down.

I agree on panics, obvs; however, I use them all the time, especially when first hacking out a structure and figuring out how I want everything to fit together. And sometimes one gets missed in the clean-up. But what else are you going to do? You can easily get lost in the weeds and miss the big picture trying to write everything perfectly from the start, and spend many hours writing code you're going to later throw away in a refactor. Anyway, commendable job tracking this down, and thanks again!