air-verse / air

☁️ Live reload for Go apps
GNU General Public License v3.0
16.35k stars 771 forks source link

Implement graceful shutdown on Mac #534

Closed altanbgn closed 1 week ago

altanbgn commented 4 months ago

I usually unconsciously do Ctrl-C and I found this what I consider a bug or maybe it could just because of my code. I'm not really sure. So when the server is up and running and everything goes perfect right? and when I exit with Ctrl-C it does not finish the exiting process causing port already in use on next start.

2024/02/23 10:50:34 ⇨ MongoDB Connected!

   ____    __
  / __/___/ /  ___
 / _// __/ _ \/ _ \
/___/\__/_//_/\___/ v4.11.4
High performance, minimalist Go web framework
https://echo.labstack.com
____________________________________O/_______
                                    O\
⇨ http server started on [::]:1323
^Ccleaning...
see you again~
❯ air

  __    _   ___
 / /\  | | | |_)
/_/--\ |_| |_| \_ v1.50.0, built with Go go1.22.0

watching .
watching cmd
watching cmd/aion
watching internal
watching internal/cmd
watching internal/db
watching internal/handlers
watching internal/middlewares
watching internal/models
watching internal/routes
watching internal/utils
!exclude tmp
building...
running...
2024/02/23 10:50:38 ⇨ MongoDB Connected!

   ____    __
  / __/___/ /  ___
 / _// __/ _ \/ _ \
/___/\__/_//_/\___/ v4.11.4
High performance, minimalist Go web framework
https://echo.labstack.com
____________________________________O/_______
                                    O\
{"time":"2024-02-23T10:50:38.557171+08:00","level":"FATAL","prefix":"echo","file":"api.go","line":"38","message":"listen tcp :1323: bind: address already in use"}
Process Exit with Code: 1

My code on graceful shutdown

    // Wait for interrupt signal to gracefully shutdown the server with a timeout of 10 seconds.
    // Use a buffered channel to avoid missing signals as recommended for signal.Notify
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, os.Interrupt)
    <-quit
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    // Shutdown server
    if err := e.Shutdown(ctx); err != nil {
        e.Logger.Fatal(err)
    }

    // Disconnect from MongoDB
    if err := db.Client.Disconnect(context.Background()); err != nil {
        log.Fatal("⇨ Error disconnecting to MongoDB: ", err)
        log.Fatal(err)
    }
altanbgn commented 4 months ago

I am fairly new to the language with 3 months experience and I diagnosed that the context.WithTimeout is causing it. I tried it with context.TODO() and it works fine now

hamza72x commented 4 months ago

Had similar issue, just use pkill and update your .air.toml like this.

root = "."
testdata_dir = "testdata"
tmp_dir = "tmp"

[build]
  bin = "./tmp/my-api-go"
  cmd = "pkill my-api-go; go build -o ./tmp/my-api-go ."
  delay = 1000
  exclude_dir = ["assets", "tmp", "vendor", "testdata", "node_modules"]
  exclude_file = []
  exclude_regex = ["_test.go"]
  exclude_unchanged = false
  follow_symlink = false
  full_bin = ""
  include_dir = []
  include_ext = ["go", "tpl", "tmpl", "html"]
  kill_delay = "0s"
  log = "build-errors.log"
  send_interrupt = false
  stop_on_error = true

[color]
  app = ""
  build = "yellow"
  main = "magenta"
  runner = "green"
  watcher = "cyan"

[log]
  time = false

[misc]
  clean_on_exit = false

[screen]
  clear_on_rebuild = false
blaskovicz commented 4 months ago

@hamza72x I have similar settings but I'm seeing different behavior (the ^C is being ignored on MacOS).

When I run my server locally I see it exit from an interrupt as expected:

I0304 12:51:10.617018   47315 server.go:142] updating pod info
I0304 12:51:10.617172   47315 server.go:108] starting server at port 7442
^C
I0304 12:51:14.062000   47315 server.go:67] quitting server
E0304 12:51:14.062116   47315 server.go:59] http: Server closed

Doing the same thing with air i see:

 / /\  | | | |_) 
/_/--\ |_| |_| \_ v1.51.0, built with Go go1.22.0

watching .
watching cmd
watching cmd/web
!exclude dist
watching internal
watching internal/api
!exclude node_modules
watching pkg
!exclude public
!exclude src
!exclude tmp
building...
running...
I0304 12:52:50.795311   48146 server.go:142] updating pod info
I0304 12:52:50.795335   48146 server.go:108] starting server at port 7442
^C
cleaning...
see you again~

but when I check lsof i still see the process running and bound to a port (and it still logs to my local terminal):

COMMAND   PID     USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
web     48146 me    3u  IPv6 0xba380d8c0d1589c5      0t0  TCP *:7442 (LISTEN)

here is my config

testdata_dir = "testdata"
tmp_dir = "tmp"

[build]
  args_bin = []
  bin = "./web"
  cmd = "go build -o ./web ./cmd/web/main.go"
  delay = 1000
  exclude_dir = ["dist", "assets", "tmp", "vendor", "testdata", "node_modules", "src", "public"]
  exclude_file = []
  exclude_regex = ["_test.go"]
  exclude_unchanged = false
  follow_symlink = false
  full_bin = ""
  include_dir = []
  include_ext = ["go", "tpl", "tmpl", "html"]
  include_file = []
  kill_delay = "0s"
  log = "build-errors.log"
  poll = false
  poll_interval = 0
  post_cmd = []
  pre_cmd = []
  rerun = false
  rerun_delay = 500
  send_interrupt = true
  stop_on_error = true

[color]
  app = ""
  build = "yellow"
  main = "magenta"
  runner = "green"
  watcher = "cyan"

[log]
  main_only = false
  time = false

[misc]
  clean_on_exit = false

[screen]
  clear_on_rebuild = false
  keep_scroll = true
leandergangso commented 4 months ago

The key here is to set send_interrupt to true in your air.toml. only then will Ctrl+c behave as expected.

nimajalali commented 4 months ago

I'm able to reproduce this error even when send_interrupt is set to true. There seems to be a race condition when shutting down air using Ctrl+c.

When air is running the bin and a ctrl+c comes in cleanup() is executed. At this line in cleanup we call to have the bin stopped. When this call happens we aren't waiting for it to finish. https://github.com/cosmtrek/air/blob/master/runner/engine.go#L582-L586

The air program typically exits before it has the chance to stop the bin and get to https://github.com/cosmtrek/air/blob/master/runner/engine.go#L521-L522

sa- commented 3 months ago

I can confirm that the server is not killed even when send_interrupt is set to true

taylow commented 3 months ago

This is also happening to me using ZSH. I have send_interrupt set to true and have tried a variety of configs but am still left with the process running after see you again~.

As a bit of a work around until this is resolved, I have set the following config to automatically kill the process using a specific port:

post_cmd = ["lsof -i tcp:8080 | awk 'NR==2{print $2}' | xargs kill"]
figuerom16 commented 3 months ago

I have a 5 second delay for connections to close on graceful shutdown. If you're listening for an interrupt then be sure to set a kill_delay as well otherwise air will close faster than your cleanup process.

# Send Interrupt signal before killing process (windows does not support this feature)
send_interrupt = true
# Delay after sending Interrupt signal
kill_delay = 5000000000 # nanosecond
figuerom16 commented 3 months ago

@altanbgn

signal.Notify(quit, os.Interrupt)

Be sure to capture syscall.SIGTERM also. systemd/docker uses that signal for stop and restart. Save yourself some headaches on deployment.

signal.Notify(quit, os.Interrupt, syscall.SIGTERM)

zeevallin commented 3 months ago

@taylow Improved on your solution a little bit.

I ensure I reap the process with a given listening port before every reload and when you quit air.

post_cmd = [
  "lsof -i :8080 -sTCP:LISTEN | sed 1d | awk '{print $2}' | xargs kill -9"
]
pre_cmd = [
  "lsof -i :8080 -sTCP:LISTEN | sed 1d | awk '{print $2}' | xargs kill -9"
]

This part of the lsof command is particularly important as it prevents your browser or other tools with open or establishing connections to your server also getting reaped.

lsof -sTCP:LISTEN

Hope it helps and I hope this will be fixed soon so we don't have to use a workaround. ❤️

iancharters commented 2 months ago

I was having the same issue as you. I have a docker container running a go service where I listen for SIGINT and SIGTERM after starting the server so I can gracefully shutdown. I was able to resolve with the following:

[build]
...
send_interrupt = true
kill_delay = 1

If you don't set something for kill delay it seems like Air eats the signal.

peterldowns commented 2 months ago

Just writing in to say that even with send_interrupt=true and kill_delay=1, I still see intermitten failure to send the ctrl-c interrupt to my program. I'm using the pre_cmd and post_cmd hacks shared above (thank you!) but it's annoying.

figuerom16 commented 2 months ago

@peterldowns @iancharters

Referring to the example toml https://github.com/cosmtrek/air/blob/master/air_example.toml kill_delay operates in nanoseconds! With a value of 1 you're giving your program maybe CPU 4 clock cycles to close up your connections and program.

At least set your delay to 100 milliseconds | kill_delay = 100000000

Just timing my DB and app close takes nearly 500 microseconds so 100 milliseconds is pretty safe for Testing/Development.

EDIT: Here is my gracefull shutdown portion for reference.

...
// Server Run
go func() {
    if err := server.ListenAndServe(); err != nil {
        log.Println(err)
    }
}()
log.Println("Server Running: http://127.0.0.1" + port)

// Graceful shutdown
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
<-c
start := time.Now()
log.Println("Gracefully shutting down...")
server.Shutdown(ctx)
data.Close()
log.Printf("Gracefully closed %s.", time.Since(start))

Output: main.go has changed building... 2024/04/24 13:39:17 Gracefully shutting down... 2024/04/24 13:39:17 http: Server closed 2024/04/24 13:39:17 Database closed 2024/04/24 13:39:17 Gracefully closed 364.905µs running... 2024/04/24 13:39:18 Server Running: http://127.0.0.1:3000

peterldowns commented 2 months ago

@figuerom16 I've tried delay values of 5000000000 as well as "5000ms" but it simply does not delay.

figuerom16 commented 2 months ago

@peterldowns Are you on the latest air also? My air -v is v1.51.0, built with Go go1.22.2. Is there sample code I can try? Are you doing anything special to build? I want to see if I can replicate the issue or if there is a service that isn't closing correctly.

peterldowns commented 2 months ago

@figuerom16 I'll set up a minimal reproduction case and share it here within the next day or so. Sorry for not doing so earlier, I realize it's hard to debug without an easy way to reproduce the problem. I'm using air installed via nix, but it's 1.51.0 built with go1.22.1. I'm on macOS Monterey version 12.5.1 on a M1 pro:

image

❯ air -v

  __    _   ___
 / /\  | | | |_)
/_/--\ |_| |_| \_ (devel), built with Go go1.22.1

❯ which air
/nix/store/1w5403arzypxxb1s2mlaylxmnmlvwjq6-air-1.51.0/bin/air
sethbrasile commented 2 months ago

Could folks having this problem give this a try?

https://github.com/cosmtrek/air/pull/575

mikael-lindstrom commented 2 months ago

@sethbrasile I have the same problem and #575 solved the issue for me 👍

sa- commented 2 months ago

This repo is a reproducible example, if anyone needs: https://github.com/sa-/goth-counter

peterldowns commented 2 months ago

@sethbrasile I can confirm that this fixed the issue for me as well. I tested by cloning your fork, building the air binary, then updating my .air.toml to comment out the pkill shenanigans, and set kill_delay = "5000ms". Now, when I hit ctrl-c, air propagates the signal to my server, and the server has time to exist cleanly. Hooray!

One suggestion — can you make it so that if the backend server has terminated, air will stop waiting? For instance:

peterldowns commented 2 months ago

Also, thank you for the fix!

ndajr commented 1 month ago

I had some issues with this as well. I tried your branch @sethbrasile, it helps but doesn't fix it entirely for me, there were times I had to manually kill the process. This code here could be more resilient: https://github.com/cosmtrek/air/blob/master/runner/util_linux.go#L12. It's ignoring some errors, and in my opinion the delay between SIGINT and SIGKILL should be the default, now the user has to know about the kill_delay config. And as @peterldowns mentioned, this ideally shouldn't be a time.Sleep with a fixed interval being idle but rather a timeout. To implement this we could signal SIGINT and spin up a go routine checking if the PID still exists, once it doesn't we signal on the procKilledCh

gbrlmza commented 1 month ago

I had the same issue and was able to permanently fix it by adding:

post_cmd = ["pkill <BINARY_NAME>"]
ndajr commented 1 month ago

pkill uses SIGTERM and also Docker kills containers with SIGTERM. Maybe we can switch from SIGINT to SIGTERM, which is a signal most likely applications are handling already, and a default timeout before SIGKILL

silverwind commented 1 month ago

Wild guess, but it seems that upgrading MacOS from 14.4.1 to 14.5 seems to have resolved this for me. Can anyone else confirm?

Previously I reproduced this problem of failing to kill process with SIGINT/SIGTERM with both air and modd, so I don't think air is to blame specifically.

justinliang1020 commented 1 month ago

I temporarily workaround this by using an older version of air (v1.46.0) and the graceful shutdown on mac works for me

silverwind commented 1 month ago

Actually no, I'm again affected on latest MacOS. Will try to downgrade again, but from cursory tests before, downgrade did not help.

piti118 commented 3 weeks ago

Very sad. I gave up and move to CompileDaemon ... this tool fails at the most basic use case for shutting down and starting again without conflicting port.

nakkamarra commented 1 week ago

I have a feeling this is related to Apple Silicon, on my Intel Macbook I don't experience this issue but colleagues on the same version of air with the same air.toml who have M1/M2/etc will experience an orphaned process on exit.

xiantang commented 1 week ago

https://github.com/air-verse/air/pull/619 try this PR, should be fix try: go install github.com/air-verse/air@360714a021b1b77e50a5656fefc4f8bb9312d328

xiantang commented 1 week ago

can someone help me test my PR please?

silverwind commented 1 week ago

I have a feeling this is related to Apple Silicon, on my Intel Macbook I don't experience this issue but colleagues on the same version of air with the same air.toml who have M1/M2/etc will experience an orphaned process on exit.

Same experience here. I get the bug all the time on my M3 Mac, but have not heard about it on Intel Macs so far.

xiantang commented 1 week ago

I have a feeling this is related to Apple Silicon, on my Intel Macbook I don't experience this issue but colleagues on the same version of air with the same air.toml who have M1/M2/etc will experience an orphaned process on exit.

Same experience here. I get the bug all the time on my M3 Mac, but have not heard about it on Intel Macs so far.

try my PR, go install github.com/air-verse/air@360714a021b1b77e50a5656fefc4f8bb9312d328 in this PR, will ensure your binary has been killed before exit air

nakkamarra commented 1 week ago

Hey @xiantang, I'll try it out now

nakkamarra commented 1 week ago

@xiantang Yeah, that build seems to consistently work on my M2 now

xiantang commented 1 week ago

try latest version

silverwind commented 1 week ago

Looks good so far, I started and stopped my app 10 times and it cleanly terminated it every time. Holding my breath to say it is really fixed, time will tell 😉.

peterldowns commented 1 week ago

Thanks @xiantang, the fixing PR looks correct to me and testing with go install github.com/air-verse/air@360714a021b1b77e50a5656fefc4f8bb9312d328 shows that everything works for me. Nice work!

ainsleyclark commented 21 hours ago

Still seems to be an issue for me unfortunately with the new version:

image
  __    _   ___  
 / /\  | | | |_) 
/_/--\ |_| |_| \_ v1.52.3, built with Go go1.22.5
root = "."
testdata_dir = "testdata"
tmp_dir = "tmp"

## See: https://github.com/cosmtrek/air/issues/534
## See: https://github.com/cosmtrek/air/pull/570

[build]
  args_bin = []
  bin = "./tmp/main"
  cmd = "go build -o ./tmp/main ."
  delay = 0
  exclude_dir = ["assets", "tmp", "vendor", "testdata", "node_modules", "src"]
  exclude_file = []
  exclude_regex = ["_test.go", "_templ.go"]
  exclude_unchanged = false
  follow_symlink = false
  full_bin = ""
  include_dir = []
  include_ext = ["go", "tpl", "tmpl", "templ", "html"]
  include_file = []
  kill_delay = "500ms"
  log = "build-errors.log"
  poll = false
  poll_interval = 0
  post_cmd = [
  ]
  pre_cmd = [

  ]
  rerun = false
  rerun_delay = 500
  send_interrupt = true
  stop_on_error = true

[color]
  app = ""
  build = "yellow"
  main = "magenta"
  runner = "green"
  watcher = "cyan"

[log]
  main_only = false
  time = false

[misc]
  clean_on_exit = false

[proxy]
  app_port = 0
  enabled = false
  proxy_port = 0

[screen]
  clear_on_rebuild = false
  keep_scroll = true