cloudflare / tableflip

Graceful process restarts in Go
BSD 3-Clause "New" or "Revised" License
2.91k stars 148 forks source link

Upgrade shutting down early for http2 connections but not http1.1 #12

Closed chrispassas closed 6 years ago

chrispassas commented 6 years ago

I'm running a web process on FreeBSD. Using curl to connect over http2 when I call SIGHUP on the process it shuts down and drops any connections without waiting for them to complete.

If I do the same but use the curl flag '--http1.1' the running connections complete before shutdown is called.

Any idea why http2 would not wait for the connections to complete? While http1.1 connections would?

Thank you,

chrispassas commented 6 years ago

For what its worth I tested using a golang client and could not reproduce the issue. So many its something to do with how curl negotiates the connection?

lmb commented 6 years ago

What do you mean when you say "drops connections"? Command line output would be useful.

A quick google points towards http/2 having support for a message which indicates that a connection should not be used anymore (GOAWAY frame). HTTP/1 doesn't have such a thing (sadly) so might take longer to close.

chrispassas commented 6 years ago

I've got a simple example that reproduces something similar.

  1. Build and start go process
  2. curl -k https://localhost:8080/test
  3. kill -s sighup PROCESSID

This is attempting to start a request that will take 20 seconds. While that request is running we run a sighup to trigger the upgrade. If this is done over http2/ssl using curl it interrupts the curl to localhost:8080/test.

If you do this without TLS or without curl it works.

I've only tried it using curl and Go as the clients. Go worked but curl did not.

curl error: "curl: (56) Unexpected EOF"

My curl version if it matters

curl -V curl 7.54.0 (x86_64-apple-darwin18.0) libcurl/7.54.0 LibreSSL/2.6.4 zlib/1.2.11 nghttp2/1.24.1 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp Features: AsynchDNS IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz HTTP2 UnixSockets HTTPS-proxy

`

var pidFile = "pid"
var listenAddr = "localhost:8080"

flag.Parse()
log.SetPrefix(fmt.Sprintf("%d ", os.Getpid()))

upg, err := tableflip.New(tableflip.Options{
    PIDFile: pidFile,
})
if err != nil {
    panic(err)
}
defer upg.Stop()

// Do an upgrade on SIGHUP
go func() {
    sig := make(chan os.Signal, 1)
    signal.Notify(sig, syscall.SIGHUP)
    for range sig {
        err := upg.Upgrade()
        if err != nil {
            log.Println("Upgrade failed:", err)
        }
    }
}()

ln, err := upg.Fds.Listen("tcp", listenAddr)
if err != nil {
    log.Fatalln("Can't listen:", err)
}

server := http.Server{
    // Set timeouts, etc.
}

http.HandleFunc("/test", func(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "Hello\n")
    time.Sleep(20 * time.Second)
})

log.Printf("pid:%d", os.Getpid())

go func() {
    err := server.ServeTLS(ln, "/Users/cpassas/go/src/silkproxy_releases/http.crt", "/Users/cpassas/go/src/silkproxy_releases/http.key")
    if err != http.ErrServerClosed {
        log.Println("HTTP server:", err)
    }
}()

log.Printf("ready")
if err := upg.Ready(); err != nil {
    panic(err)
}
<-upg.Exit()

// Make sure to set a deadline on exiting the process
// after upg.Exit() is closed. No new upgrades can be
// performed if the parent doesn't exit.
time.AfterFunc(30*time.Second, func() {
    log.Println("Graceful shutdown timed out")
    os.Exit(1)
})

// Wait for connections to drain.
server.Shutdown(context.Background())

`

lmb commented 6 years ago

I see the same behaviour as well, on Go 1.11.

[  0.006] Connected
The negotiated protocol: h2
[  0.022] recv SETTINGS frame <length=24, flags=0x00, stream_id=0>
          (niv=4)
          [SETTINGS_MAX_FRAME_SIZE(0x05):1048576]
          [SETTINGS_MAX_CONCURRENT_STREAMS(0x03):250]
          [SETTINGS_MAX_HEADER_LIST_SIZE(0x06):1048896]
          [SETTINGS_INITIAL_WINDOW_SIZE(0x04):1048576]
[  0.022] send SETTINGS frame <length=12, flags=0x00, stream_id=0>
          (niv=2)
          [SETTINGS_MAX_CONCURRENT_STREAMS(0x03):100]
          [SETTINGS_INITIAL_WINDOW_SIZE(0x04):65535]
[  0.022] send SETTINGS frame <length=0, flags=0x01, stream_id=0>
          ; ACK
          (niv=0)
[  0.022] send PRIORITY frame <length=5, flags=0x00, stream_id=3>
          (dep_stream_id=0, weight=201, exclusive=0)
[  0.022] send PRIORITY frame <length=5, flags=0x00, stream_id=5>
          (dep_stream_id=0, weight=101, exclusive=0)
[  0.022] send PRIORITY frame <length=5, flags=0x00, stream_id=7>
          (dep_stream_id=0, weight=1, exclusive=0)
[  0.022] send PRIORITY frame <length=5, flags=0x00, stream_id=9>
          (dep_stream_id=7, weight=1, exclusive=0)
[  0.022] send PRIORITY frame <length=5, flags=0x00, stream_id=11>
          (dep_stream_id=3, weight=1, exclusive=0)
[  0.022] send HEADERS frame <length=43, flags=0x25, stream_id=13>
          ; END_STREAM | END_HEADERS | PRIORITY
          (padlen=0, dep_stream_id=11, weight=16, exclusive=0)
          ; Open new stream
          :method: GET
          :path: /test
          :scheme: https
          :authority: localhost:8080
          accept: */*
          accept-encoding: gzip, deflate
          user-agent: nghttp2/1.30.0
[  0.022] recv WINDOW_UPDATE frame <length=4, flags=0x00, stream_id=0>
          (window_size_increment=983041)
[  0.022] recv SETTINGS frame <length=0, flags=0x01, stream_id=0>
          ; ACK
          (niv=0)

I think the server should send a GOAWAY frame in this case. Adding a sleep after Shutdown() has returned does not fix this, so it's not just a case of not waiting long enough.

chrispassas commented 6 years ago

@lmb do you believe this might be a bug in tableflip? Or something that can be addressed in the example code the readme provides?

lmb commented 6 years ago

I don't think this is a tableflip bug, mainly because there is nothing specific to http2 in the library. I've run the test case a few more times, most of the time I get a GOAWAY, but sometimes the connection just drops. This seems like a race condition in net/http2 of some sort. I need to find a reliable way to reproduce this to figure out what is going on.

chrispassas commented 6 years ago

It may not be a tableflip bug but the example uses the http service. I think most would infer that exact code will work.

Maybe the example can be updated so that it will work with HTTP 2 but I’m not sure if that’s possible.

It does limit the usefulness if this can’t be used with go HTTP processes especially since you can’t force it to use HTTP 1 all the time.

lmb commented 6 years ago

I've repeated the test using curl and snooped the traffic using wireshark. Curl says:

$ SSLKEYLOGFILE=tlskeys curl -kv --http2-prior-knowledge https://localhost:8080/test
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=AU; ST=Some-State; O=Internet Widgits Pty Ltd; CN=localhost
*  start date: Oct 26 14:59:20 2018 GMT
*  expire date: Oct 26 14:59:20 2019 GMT
*  issuer: C=AU; ST=Some-State; O=Internet Widgits Pty Ltd; CN=localhost
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55b2823fc8e0)
> GET /test HTTP/2
> Host: localhost:8080
> User-Agent: curl/7.58.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
* TLSv1.2 (IN), TLS alert, Client hello (1):
* Unexpected EOF
* Connection #0 to host localhost left intact
curl: (56) Unexpected EOF

From wireshark it's clear that GOAWAY is sent:

HyperText Transfer Protocol 2
    Stream: GOAWAY, Stream ID: 0, Length 8
        Length: 8
        Type: GOAWAY (7)
        Flags: 0x00
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 0000 0000 0000 = Stream Identifier: 0
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 0000 0000 0001 = Promised-Stream-ID: 1
        Error: NO_ERROR (0)

Seems like this is a bug in curl, which was fixed in curl/curl#2510 / 7.60. Go net/http in combination with tableflip is doing the right thing here.

Re-reading your first comment it seems like you expect Go to wait 20s until exiting, the same as with HTTP/1?

chrispassas commented 6 years ago

My example code the 20 seconds is just to simulate an HTTP request doing some work and finishing even after the sighup signal.

If this is just a curl bug that would be great. I was just testing tableflip using curl and ran into this issue. Similar to you I tried my test using the Go http client and could not reproduce the problem.

My local curl version is "curl 7.54.0" so it would fall into the version affected by the bug you posted.

I think we can close this unless someone comes up with another way to reproduce using a newer curl or go client. Agree?

lmb commented 6 years ago

Well, I think your test points out an interesting difference between H1 and H2. H1 will block until the handler has exited, while H2 "aborts" the handler. This is potentially a bug in the go standard library, do you want to raise an issue with them?

In the meantime I'll close this issue.