cloudflare / tableflip

Graceful process restarts in Go
BSD 3-Clause "New" or "Revised" License
2.91k stars 148 forks source link

sometimes got error like "listen tcp xxx: bind: address already in use" when Upgrade #29

Closed zhiqiangxu closed 5 years ago

zhiqiangxu commented 5 years ago

I think the key of using tableflip is to replace net.Listen with upg.Fds.Listen, so that when Upgrade the listening socket will be inherited by child.

But I've got errors like below when Upgrade after the application has run for quite long in production environment.

{"level":"error","msg":"ListenAndServe err can't create new listener: listen tcp 0.0.0.0:8902: bind: address already in use","time":"2019-05-31T23:05:14+08:00"}

It seems that the parent has opened a listener on port 8902 but the child doesn't inherit that listener. Any possible reason?

lmb commented 5 years ago

I think the key of using tableflip is to replace net.Listen with upg.Fds.Listen, so that when Upgrade the listening socket will be inherited by child.

That sounds about right.

Can you share the snippet that does the listen and then serve? Looks like you're using net/http, are you calling ListenAndServe directly?

zhiqiangxu commented 5 years ago

I'm listening with upg.Fds.Listen and then pass the net.Listener to serve function, I do the same for both http server and my private tcp server:

        ...
        upg, err := tableflip.New(tableflip.Options{PIDFile: pidFile})
    if err != nil {
        panic(err)
    }
    defer upg.Stop()
        ....
    httpServer := &http.Server{Addr: httpAddr}
    ln, err := upg.Fds.Listen("tcp", httpAddr)
    if err != nil {
        instance.Logger().Errorln("Listen httpAddr err", err)
        return
    }

    var g run.Group
    g.Add(func() error {
                 // upg.Fds.Listen is passed to qserver as func, and called inside ListenAndServe
        err := qserver.ListenAndServe()
        if err != nil {
            instance.Logger().Errorln("ListenAndServe err", err)
        }
        return err
    }, func(err error) {
        shutdownQRPC(qserver)
        instance.Logger().Infoln("shutdownQRPC err", err)
    })
    g.Add(func() error {
                ...apis omited...
        err = httpServer.Serve(ln)
        if err != nil {
            instance.Logger().Errorln("httpServer.Serve err", err)
        }
        return err
    }, func(err error) {
        shutdownHTTP(httpServer)
        instance.Logger().Infoln("httpServer.Shutdown", err)
    })

    go func() {
        sig := make(chan os.Signal, 1)
        signal.Notify(sig, os.Interrupt, os.Kill, syscall.SIGTERM, syscall.SIGUSR1, syscall.SIGQUIT)
        for range sig {
            // upgrade on signal
            instance.Logger().Errorln("GracefulUpgrade", config.Load().GracefulUpgrade)

            instance.Logger().Errorln("upgrade start")
            err := upg.Upgrade()
            instance.Logger().Errorln("upgrade end", err)
            if err != nil {
                instance.Logger().Errorln("upgrade failed:", err)
            }
        }

    }()

    groupDoneCh := make(chan struct{})
    go func() {
        err := g.Run()
        if err != nil {
            instance.Logger().Errorln("Group.Run err", err)
        }
        close(groupDoneCh)
    }()

    if err := upg.Ready(); err != nil {
        instance.Logger().Errorln("upg.Ready err", err)
        return
    }

    instance.Logger().Infoln("child ready to serve")
    // ready to exit
    select {
    case <-upg.Exit():
    case <-groupDoneCh:
    }
    instance.Logger().Errorln("parent prepare for exit")

    shutdownQRPC(qserver)
    shutdownHTTP(httpServer)

And here's the code for qserver.ListenAndServe:

// ListenAndServe starts listening on all bindings
func (srv *Server) ListenAndServe() error {

    var g run.Group

    for i := range srv.bindings {
        idx := i
        binding := srv.bindings[i]
        g.Add(func() error {
            return srv.listenAndServe(idx, binding)
        }, func(err error) {
            serr := srv.Shutdown()
            LogError("err", err, "serr", serr)
        })
    }

    return g.Run()
}

func (srv *Server) listenAndServe(idx int, binding ServerBinding) (err error) {
    var (
        ln net.Listener
    )

    if binding.ListenFunc != nil {
        ln, err = binding.ListenFunc("tcp", binding.Addr)
    } else {
        ln, err = net.Listen("tcp", binding.Addr)
    }
    if err != nil {
        return
    }

    err = srv.Serve(ln.(*net.TCPListener), idx)

    return
}

Here's the complete code:

https://github.com/zhiqiangxu/qchat/blob/master/app/server/common.go

lmb commented 5 years ago

Ah, I think the problem is that you have a race condition between calling Ready in main and Listen in Server.ListenAndServe. Ready closes all inherited-but-unused Fds. The following ordering is causing your problem:

  1. go ListenAndServe()
  2. Ready()
  3. binding.ListenFunc()

Here, binding.ListenFunc won't be able to use the inherited Fd, because it was already closed by Ready. It then tries to do a regular net.Listen, which fails because the old process is still around and holding onto the listening socket.

To fix this I recommend creating all listeners in main() and only then calling Ready.

zhiqiangxu commented 5 years ago

Nice catch! I'll try it out asap:)