mberhault commented 7 years ago

This could be anything from libpq, TLS connection code, or the server side.

cockroach sha: 019eeb7b64386a9d751d2055d2af52f24a93ab3d (beta-20170216) libpq sha: ba5d4f7a35561e22fbdf7a39aa0070f4d460cfc0 (latest as of now)

The shorturl example creates a sql.DB object at startup and reuses it forever. The load is incredibly rare, some at startup (create table if not exists and a few inserts/updates), then only upon requests to https://crdb.io.

When no requests have been issued for a while (seemingly just a few minutes), the first request issued will block for ~15 minutes, then finally return successfully, and with the right statements results.

The client goroutine profile shows it sitting in:

goroutine 14 [IO wait, 14 minutes]:
net.runtime_pollWait(0x7f34966b5dc8, 0x72, 0x3)
    /home/marc/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc42004e450, 0x72, 0xc420166f90, 0xc420018130)
    /home/marc/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc42004e450, 0x93dfe0, 0xc420018130)
    /home/marc/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc42004e3f0, 0xc420181400, 0x400, 0x400, 0x0, 0x93dfe0, 0xc420018130)
    /home/marc/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc42014a020, 0xc420181400, 0x400, 0x400, 0x0, 0x0, 0x0)
    /home/marc/go/src/net/net.go:173 +0x70
crypto/tls.(*block).readFromUntil(0xc420131710, 0x7f3496679138, 0xc42014a020, 0x5, 0xc42014a020, 0x8bfe0d044af9e97f)
    /home/marc/go/src/crypto/tls/conn.go:476 +0x91
crypto/tls.(*Conn).readRecord(0xc420161180, 0x7e3d17, 0xc420161288, 0x0)
    /home/marc/go/src/crypto/tls/conn.go:578 +0xc4
crypto/tls.(*Conn).Read(0xc420161180, 0xc42024b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /home/marc/go/src/crypto/tls/conn.go:1113 +0x116
bufio.(*Reader).fill(0xc420058cc0)
    /home/marc/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Read(0xc420058cc0, 0xc4200d90a0, 0x5, 0x200, 0x2, 0x7e3050, 0x0)
    /home/marc/go/src/bufio/bufio.go:209 +0x1bc
io.ReadAtLeast(0x93baa0, 0xc420058cc0, 0xc4200d90a0, 0x5, 0x200, 0x5, 0xc42016a0c0, 0x20, 0x18)
    /home/marc/go/src/io/io.go:307 +0xa4
io.ReadFull(0x93baa0, 0xc420058cc0, 0xc4200d90a0, 0x5, 0x200, 0x45a4f0, 0xc420024800, 0x7f34967114b0)
    /home/marc/go/src/io/io.go:325 +0x58
github.com/lib/pq.(*conn).recvMessage(0xc4200d9080, 0xc42016a0c0, 0x733720, 0x1, 0x97dc10)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:957 +0x117
github.com/lib/pq.(*conn).recv1Buf(0xc4200d9080, 0xc42016a0c0, 0x0)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:1007 +0x39
github.com/lib/pq.(*conn).recv1(0xc4200d9080, 0x49579a, 0xc420161180)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:1028 +0x87
github.com/lib/pq.(*conn).readParseResponse(0xc4200d9080)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:1574 +0x2f
github.com/lib/pq.(*conn).prepareTo(0xc4200d9080, 0x7c6dc9, 0x6e, 0x0, 0x0, 0x3ca)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:784 +0x63c
github.com/lib/pq.(*conn).query(0xc4200d9080, 0x7c6dc9, 0x6e, 0xc4201440b0, 0x1, 0x1, 0x0, 0x0, 0x0)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:855 +0x3ae
github.com/lib/pq.(*conn).Query(0xc4200d9080, 0x7c6dc9, 0x6e, 0xc4201440b0, 0x1, 0x1, 0x1, 0x0, 0x0, 0x10)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:827 +0x64
database/sql.(*DB).queryConn(0xc420105810, 0xc4201487e0, 0xc4201440a0, 0x7c6dc9, 0x6e, 0xc420167a60, 0x1, 0x1, 0xc420144090, 0x0, ...)
    /home/marc/go/src/database/sql/sql.go:1092 +0x5d7
database/sql.(*DB).query(0xc420105810, 0x7c6dc9, 0x6e, 0xc420167a60, 0x1, 0x1, 0x1, 0xc420010901, 0xc420144090, 0xc4201679d8)
    /home/marc/go/src/database/sql/sql.go:1079 +0xf0
database/sql.(*DB).Query(0xc420105810, 0x7c6dc9, 0x6e, 0xc420167a60, 0x1, 0x1, 0x1, 0x9500000000000000, 0x5)
    /home/marc/go/src/database/sql/sql.go:1062 +0x90
main.getShortysByOwner(0xc42016a0a0, 0x16, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/db.go:148 +0x181
main.handleNew(0x9408e0, 0xc4201540d0, 0xc42023a780)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:92 +0xac
main.handleSettings(0x9408e0, 0xc4201540d0, 0xc42023a780)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:73 +0x249
main.handleRoot(0x9408e0, 0xc4201540d0, 0xc42023a780)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:41 +0xff
net/http.HandlerFunc.ServeHTTP(0x7e3458, 0x9408e0, 0xc4201540d0, 0xc42023a780)
    /home/marc/go/src/net/http/server.go:1726 +0x44
net/http.(*ServeMux).ServeHTTP(0x961380, 0x9408e0, 0xc4201540d0, 0xc42023a780)
    /home/marc/go/src/net/http/server.go:2022 +0x7f
net/http.serverHandler.ServeHTTP(0xc42013d000, 0x9408e0, 0xc4201540d0, 0xc42023a780)
    /home/marc/go/src/net/http/server.go:2202 +0x7d
net/http.(*conn).serve(0xc420072c00, 0x9410e0, 0xc420011000)
    /home/marc/go/src/net/http/server.go:1579 +0x4b7
created by net/http.(*Server).Serve
    /home/marc/go/src/net/http/server.go:2293 +0x44d

oauth_proxy log:

206.252.208.94 - marc@cockroachlabs.com [27/Feb/2017:18:16:33 +0000] crdb.io GET 127.0.0.1:8080 "/" HTTP/1.1 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" 200 1426 935.473

Showing 935s before successful response. Strangely enough, slow responses always seem to be 935.X seconds. No others were found

Chrome shows: req_details

During that time, netstat shows a single active tcp connection from the client to the cockroach cluster:

cockroach@shared-0001:~$ netstat -n|grep 26257
tcp        0    160 192.168.1.4:59236       40.70.216.11:26257      ESTABLISHED

However, that one cockroach node (cockroach-catrina-0003) shows many:

cockroach@cockroach-catrina-0003:~$ netstat -n|grep 13\.77
tcp6       0      0 192.168.1.6:26257       13.77.108.155:42468     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:35464     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:42916     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:57832     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:54214     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:59562     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:49342     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:59236     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:55074     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:58614     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:53834     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:33986     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:41164     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:53384     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:59656     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:58912     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:53778     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:35144     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:55374     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:55528     ESTABLISHED
tcp6       0      0 192.168.1.6:26257       13.77.108.155:57758     ESTABLISHED

No particular messages in the cockroach node logs. /debug/events shows nothing out of the ordinary, the time spent in sql is reasonable:

2017/02/27 18:09:29.402223  3220.552906 13.77.108.155:59236
#   github.com/cockroachdb/cockroach/pkg/util/log.WithEventLog          /go/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:66
#   github.com/cockroachdb/cockroach/pkg/sql.NewSession             /go/src/github.com/cockroachdb/cockroach/pkg/sql/session.go:153
#   github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).setupSession  /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:309
#   github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*v3Conn).serve     /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/v3.go:333
#   github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).ServeConn     /go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:415
#   github.com/cockroachdb/cockroach/pkg/server.(*Server).Start.func8.1     /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:536
#   github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith.func1 /go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:137
18:09:29.402821  .   598    ... [client=13.77.108.155:59236,user=shorty,n3] session: requesting 10240 bytes from the pool
18:09:29.402830  .     9    ... [client=13.77.108.155:59236,user=shorty,n3] preparing: SELECT id, url, public, date_added FROM urls WHERE added_by = $1 AND reserved = false ORDER BY date_added DESC
18:09:29.429146  . 26315    ... [client=13.77.108.155:59236,user=shorty,n3] client.Txn did AutoCommit. err: <nil> txn: "unnamed" id=90a53b47 key=/Table/11/1/51/1/2017-02-27T18:13:42.112225Z/3/0 rw=true pri=0.01511914 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1488218969.418401338,0 orig=1488218969.418401338,0 max=1488218969.918401338,0 wto=false rop=false
18:09:29.430185  .  1040    ... [client=13.77.108.155:59236,user=shorty,n3] SELECT id, url, public, date_added FROM urls WHERE (added_by = $1) AND (reserved = false) ORDER BY date_added DESC
18:09:29.438038  .  7853    ... [client=13.77.108.155:59236,user=shorty,n3] SELECT 1 done

However, the number of sql connections reported by cockroach is continuously increasing, even long after the client has been restarted.

mberhault commented 7 years ago

@mjibson: you're probably the one most familiar with libpq. It could be the culprit, but could just as easily be server-side, or deep in the guts of TLS.

maddyblue commented 7 years ago

Which version of Go?

mberhault commented 7 years ago

go1.7.3 linux/amd64 for the client server is at:

Build Tag:    beta-20170216
Build Time:   2017/02/16 13:45:12
Distribution: CCL
Platform:     linux amd64
Go Version:   go1.7.5
C Compiler:   gcc 4.9.3
Build SHA-1:  019eeb7b64386a9d751d2055d2af52f24a93ab3d
Build Type:   release

maddyblue commented 7 years ago

There were lots of changes in database/sql in 1.8. Is it easy to test with that to see if we get a repro? I looked briefly at lib/pq and this stack track but nothing obvious came up.

mberhault commented 7 years ago

will do.

mberhault commented 7 years ago

Tried again with a client built using go 1.8, same behavior. Additionally, I went and upgraded the cockroach nodes to last week's beta while the client was stuck in the state listed above. It remained stuck while all nodes were down (makes sense, we're not specifying a timeout). After bringing the nodes back up, the client is still stuck.

I should mention that while testing things on my local machine, I encountered no issues. The differences were: single node osx binary, insecure mode, and no oauth-proxy (I don't see how oauth-proxy would make a difference, we're already in the lib/pq code). But insecure might change things.

mberhault commented 7 years ago

no issues when trying it locally in secure mode. Also tried using a single node's address in the connection string as opposed to the A record that resolves to all of them. No change.

mberhault commented 7 years ago

another thing to note (that one definitely on the server side): the sql connections remain open long after the client has retried with another connection, or even restarted. Not sure who to ping about that. @asubiotto maybe?

asubiotto commented 7 years ago

Will look more into this ASAP.

mberhault commented 7 years ago

so far, I've got it down to this test:

cockroach@shared-0001:~$ ./cockroach sql --url 'postgres://shorty@cockroach-catrina-0001.crdb.io:26257/shorty?sslmode=verify-ca&sslrootcert=certs/ca.crt&sslcert=certs/shorty.client.crt&sslkey=certs/shorty.client.key'
shorty@cockroach-catrina-0001.crdb.io:26257/shorty> SHOW TABLES;
+----------+
|  Table   |
+----------+
| counters |
| urls     |
+----------+
(2 rows)
shorty@cockroach-catrina-0001.crdb.io:26257/shorty>  
### Wait 5 minutes.
shorty@cockroach-catrina-0001.crdb.io:26257/shorty> SHOW TABLES;
### Hangs!

Seems to be after ~250s. The closest bracket I have is "not wedged" at 245s, "wedged" at 255s. After that, I'll try with the cluster in insecure mode.

mberhault commented 7 years ago

oops. this happens in insecure mode too, same amount of time.

A quick timeline of events, specifically established tcp connections:

0: establishing first connection and issuing basic queries:

/debug/events shows sql query details for 13.77.108.155:47418

netstat on the client shows

Proto Recv-Q Send-Q Local Address           Foreign Address         State 
tcp        0      0 192.168.1.4:47418       40.70.216.136:26257     ESTABLISHED

netstat on the node shows

Proto Recv-Q Send-Q Local Address           Foreign Address         State 
tcp6       0      0 192.168.1.4:26257       13.77.108.155:47418     ESTABLISHED

+250s: while second query is wedged

same trace on /debug/events, no additional information

netstat on the client shows

Proto Recv-Q Send-Q Local Address           Foreign Address         State 
tcp        0    131 192.168.1.4:47418       40.70.216.136:26257     ESTABLISHED

netstat on the node shows

Proto Recv-Q Send-Q Local Address           Foreign Address         State 
tcp6       0      0 192.168.1.4:26257       13.77.108.155:47418     ESTABLISHED

This seems to indicate bad communication between the client/server, specifically the server never acking the client data.

My next step is to try from within the same Azure private network (the client and nodes are on two separate networks, but obviously the firewall allows communication).

client goroutine while the second query is wedged:

goroutine 26 [IO wait]:
net.runtime_pollWait(0x7faee0a97e78, 0x72, 0x3)
    /home/marc/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4201367d8, 0x72, 0x9cf0a0, 0x9cb628)
    /home/marc/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4201367d8, 0xc420077000, 0x1000)
    /home/marc/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc420136770, 0xc420077000, 0x1000, 0x1000, 0x0, 0x9cf0a0, 0x9cb628)
    /home/marc/go/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc42000e0b0, 0xc420077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /home/marc/go/src/net/net.go:181 +0x70
bufio.(*Reader).Read(0xc42005ab40, 0xc420154020, 0x5, 0x200, 0xc420017300, 0x1, 0x3)
    /home/marc/go/src/bufio/bufio.go:213 +0x312
io.ReadAtLeast(0x9ccd20, 0xc42005ab40, 0xc420154020, 0x5, 0x200, 0x5, 0x4110a2, 0xc42011e5a0, 0x20)
    /home/marc/go/src/io/io.go:307 +0xa9
io.ReadFull(0x9ccd20, 0xc42005ab40, 0xc420154020, 0x5, 0x200, 0x0, 0xc420026c00, 0x0)
    /home/marc/go/src/io/io.go:325 +0x58
github.com/lib/pq.(*conn).recvMessage(0xc420154000, 0xc42011e5a0, 0x7939a0, 0x1, 0xa20e40)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:957 +0x13e
github.com/lib/pq.(*conn).recv1Buf(0xc420154000, 0xc42011e5a0, 0xc420050230)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:1007 +0x39
github.com/lib/pq.(*conn).recv1(0xc420154000, 0x60096f, 0xc42000e0b0)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:1028 +0x85
github.com/lib/pq.(*conn).readParseResponse(0xc420154000)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:1574 +0x2f
github.com/lib/pq.(*conn).prepareTo(0xc420154000, 0x837f00, 0x6e, 0x0, 0x0, 0xc)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:784 +0x5d7
github.com/lib/pq.(*conn).query(0xc420154000, 0x837f00, 0x6e, 0xc420120510, 0x1, 0x1, 0x0, 0x0, 0x0)
    /home/marc/cockroach/src/github.com/lib/pq/conn.go:855 +0x34b
github.com/lib/pq.(*conn).QueryContext(0xc420154000, 0x9d2a20, 0xc42001c188, 0x837f00, 0x6e, 0xc4201625d0, 0x1, 0x1, 0x7faee0aa49c0, 0x454d60, ...)
    /home/marc/cockroach/src/github.com/lib/pq/conn_go18.go:21 +0x1f1
database/sql.ctxDriverQuery(0x9d2a20, 0xc42001c188, 0x7faee0aa49c0, 0xc420154000, 0x837f00, 0x6e, 0xc4201625d0, 0x1, 0x1, 0x42afce, ...)
    /home/marc/go/src/database/sql/ctxutil.go:48 +0x28d
database/sql.(*DB).queryConn.func1()
    /home/marc/go/src/database/sql/sql.go:1264 +0x99
database/sql.withLock(0x9cfb20, 0xc420116380, 0xc420045850)
    /home/marc/go/src/database/sql/sql.go:2545 +0x65
database/sql.(*DB).queryConn(0xc420138280, 0x9d2a20, 0xc42001c188, 0xc420116380, 0xc420120470, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, ...)
    /home/marc/go/src/database/sql/sql.go:1265 +0x671
database/sql.(*DB).query(0xc420138280, 0x9d2a20, 0xc42001c188, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, 0x1, 0xc420026c00, ...)
    /home/marc/go/src/database/sql/sql.go:1250 +0x12f
database/sql.(*DB).QueryContext(0xc420138280, 0x9d2a20, 0xc42001c188, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, 0xc420120450, 0xc420045ab8, ...)
    /home/marc/go/src/database/sql/sql.go:1227 +0xb8
database/sql.(*DB).Query(0xc420138280, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, 0x1, 0x3700000000000000, 0x4)
    /home/marc/go/src/database/sql/sql.go:1241 +0x82
main.getShortysByOwner(0xc42011e2e0, 0x16, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/db.go:148 +0x158
main.handleNew(0x9d22a0, 0xc4201600e0, 0xc42000a800)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:92 +0xac
main.handleSettings(0x9d22a0, 0xc4201600e0, 0xc42000a800)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:73 +0x1f2
main.handleRoot(0x9d22a0, 0xc4201600e0, 0xc42000a800)
    /home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:41 +0xb7
net/http.HandlerFunc.ServeHTTP(0x8398b0, 0x9d22a0, 0xc4201600e0, 0xc42000a800)
    /home/marc/go/src/net/http/server.go:1942 +0x44
net/http.(*ServeMux).ServeHTTP(0xa046c0, 0x9d22a0, 0xc4201600e0, 0xc42000a800)
    /home/marc/go/src/net/http/server.go:2238 +0x130
net/http.serverHandler.ServeHTTP(0xc42009ed10, 0x9d22a0, 0xc4201600e0, 0xc42000a800)
    /home/marc/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc420138460, 0x9d29e0, 0xc42012ca80)
    /home/marc/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
    /home/marc/go/src/net/http/server.go:2668 +0x2ce

mberhault commented 7 years ago

This only occurs when talking to a node using its public address. eg: cockroach-catrina-0001.crdb.io resolves to the public address, cockroach-catrina-0001 when on the same private network (Azure terminology) resolves to the private address.

Any attempt to talk to cockroach-catrina-0001.crdb.io fails, be it from outside the private network, within the private network, or even on the same machine. However, using cockroach-catrina-0001 (restricted to the private network) works, both on the same machine or another.

Sample run on cockroach-catrina-0001 itself:

# Against public address
cockroach@cockroach-catrina-0001:~$ ./gotests -sleep 300s 'postgres://shorty@cockroach-catrina-0001.crdb.io:26257/shorty?sslmode=disable'
2017/03/08 22:21:06 Connecting to postgres://shorty@cockroach-catrina-0001.crdb.io:26257/shorty?sslmode=disable
2017/03/08 22:21:06 Executing SHOW DATABASES
2017/03/08 22:21:06 Sleeping 5m0s
2017/03/08 22:26:06 Executing SHOW DATABASES
2017/03/08 22:26:11 Wedged!!!

# Against private address
cockroach@cockroach-catrina-0001:~$ ./gotests -sleep 300s 'postgres://shorty@cockroach-catrina-0001:26
257/shorty?sslmode=disable'
2017/03/08 22:26:55 Connecting to postgres://shorty@cockroach-catrina-0001:26257/shorty?sslmode=disable
2017/03/08 22:26:55 Executing SHOW DATABASES
2017/03/08 22:26:55 Sleeping 5m0s
2017/03/08 22:31:55 Executing SHOW DATABASES
2017/03/08 22:31:55 Not wedged

So something's iffy in their firewall. tcpdump is up next.

mberhault commented 7 years ago

well crap! https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout which mentions this gem: `

I can tweak the idle timeout for a given public IP. eg:

$ azure network public-ip set cockroach-catrina cockroach-catrina-0001 -i 5

(the 5 is idle minutes).

However, my wedginess detector (TM) still fails below minutes. Could be a config propagation issue.

It looks like this may trigger a few things:

bug Azure about why connections remain established on the server side
figure out why client connections take so long (~15min) to be considered bad and retried again
figure out how to do keep-alives in pgwire
azure idle timeout doesn't work as advertized (stays at 4 minutes)

tamird commented 7 years ago

On Wed, Mar 8, 2017 at 5:54 PM, marc notifications@github.com wrote:

figure out how to do keep-alives in pgwire

Would TCP keepalives work? That's all I could find in the pgwire docs.

mberhault commented 7 years ago

they should, I'm testing that now. mongo ran into the same thing: https://docs.mongodb.com/manual/administration/production-notes/#windows-azure-production-notes

mberhault commented 7 years ago

this isn't just the CLI, this is any client whatsoever. there are other bugs here too, see the list a few comments earlier.

mberhault commented 7 years ago

ok, changing the kernel keepalive settings ($ sysctl -w net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 120) doesn't help, the connection needs to be setup with keepalive enable. bleh!

tamird commented 7 years ago

Ah, I see; good point. I think we may need to take this upstream with support for keepalives and friends [0].

[0] https://www.postgresql.org/docs/9.3/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS

On Wed, Mar 8, 2017 at 6:12 PM, marc notifications@github.com wrote:

ok, changing the kernel keepalive settings ($ sysctl -w net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 120) doesn't help, the connection needs to be setup with keepalive enable. bleh!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mberhault commented 7 years ago

Yup. @mjibson seems to have some pull over there :) in the meantime, I'll add this to my long list of "the cloud sucks!" gripes and ping Azure

tamird commented 7 years ago

Heh, I have commit over there as well.

On Wed, Mar 8, 2017 at 6:19 PM, marc notifications@github.com wrote:

Yup. @mjibson https://github.com/mjibson seems to have some pull over there :) in the meantime, I'll add this to my long list of "the cloud sucks!" and ping Azure

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13823#issuecomment-285202290, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPOrrX0OuuhPipAsMZKzzl6L-8Q8vks5rjzdogaJpZM4MNf5J .

maddyblue commented 7 years ago

@tamird why did you close that keep alive PR? Did it not work?

tamird commented 7 years ago

It wasn't general enough; it doesn't fix other clients, say.

On Mar 8, 2017 23:25, "Matt Jibson" notifications@github.com wrote:

@tamird https://github.com/tamird why did you close that keep alive PR? Did it not work?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13823#issuecomment-285251203, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPPl0GrA2Ji-BIEDM9I6yIW9ZQup8ks5rj383gaJpZM4MNf5J .

mberhault commented 7 years ago

Ok, I went for barebones tcp, and the connection indeed gets dropped on the floor, both from a client and server pov.

on shared-0001 (external address: crdb.io), I ran (at the same time):

# capture all packets to/from tcp port 8888
$ sudo tcpdump -i any  'tcp port 8888' -s 65535 -w server.dump

# Basic "echo" service on tcp port 8888
$ ncat -l 192.168.1.4 8888 -k -e /bin/cat

On my machine, I ran (at the same time):

# capture all packets to/from tcp port 8888
sudo tcpdump -i any  'tcp port 8888' -s 65535 -w client.dump

# send stuff to crbd.io on tcp 8888:
$ cat tcp_sleep.sh
echo "starting..."
sleep 255
echo "done..."
$ ./tcp_sleep.sh | ncat crdb.io 8888  -v -v -v
Ncat: Version 7.01 ( https://nmap.org/ncat )
NCAT DEBUG: Using system default trusted CA certificates and those in /etc/ssl/certs/ca-certificates.crt.
libnsock nsock_iod_new2(): nsock_iod_new (IOD #1)
libnsock nsock_connect_tcp(): TCP connection requested to 13.77.108.155:8888 (IOD #1) EID 8
libnsock nsock_trace_handler_callback(): Callback: CONNECT SUCCESS for EID 8 [13.77.108.155:8888]
Ncat: Connected to 13.77.108.155:8888.
libnsock nsock_iod_new2(): nsock_iod_new (IOD #2)
libnsock nsock_read(): Read request from IOD #1 [13.77.108.155:8888] (timeout: -1ms) EID 18
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #2 [peer unspecified] EID 26
libnsock nsock_trace_handler_callback(): Callback: READ SUCCESS for EID 26 [peer unspecified] (12 bytes): starting....
libnsock nsock_write(): Write request for 12 bytes to IOD #1 EID 35 [13.77.108.155:8888]
libnsock nsock_trace_handler_callback(): Callback: WRITE SUCCESS for EID 35 [13.77.108.155:8888]
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #2 [peer unspecified] EID 42
libnsock nsock_trace_handler_callback(): Callback: READ SUCCESS for EID 18 [13.77.108.155:8888] (12 bytes): starting....
starting...
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #1 [13.77.108.155:8888] EID 50
libnsock nsock_trace_handler_callback(): Callback: READ SUCCESS for EID 42 [peer unspecified] (8 bytes): done....
libnsock nsock_write(): Write request for 8 bytes to IOD #1 EID 59 [13.77.108.155:8888]
libnsock nsock_trace_handler_callback(): Callback: WRITE SUCCESS for EID 59 [13.77.108.155:8888]
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #2 [peer unspecified] EID 66
libnsock nsock_trace_handler_callback(): Callback: READ EOF for EID 66 [peer unspecified]
# Just hangs.

Server side tcp capture: nothing shows up after the initial connection, not even a RST. server

Client side tcp capture: nothing responds to the data sent after 4 minutes, but we keep trying. client

So yeah, Azure doesn't do networking.

mberhault commented 7 years ago

Just going to jot this down real quick:

RFC5382: NAT Behavioral Requirements for TCP established connection idle-timeout:

The "established connection idle-timeout" for a NAT is defined as the
   minimum time a TCP connection in the established phase must remain
   idle before the NAT considers the associated session a candidate for
   removal.  The "transitory connection idle-timeout" for a NAT is
   defined as the minimum time a TCP connection in the partially open or
   closing phases must remain idle before the NAT considers the
   associated session a candidate for removal.  TCP connections in the
   TIME_WAIT state are not affected by the "transitory connection idle-
   timeout".

   REQ-5:  If a NAT cannot determine whether the endpoints of a TCP
      connection are active, it MAY abandon the session if it has been
      idle for some time.  In such cases, the value of the "established
      connection idle-timeout" MUST NOT be less than 2 hours 4 minutes.
      The value of the "transitory connection idle-timeout" MUST NOT be
      less than 4 minutes.
      a) The value of the NAT idle-timeouts MAY be configurable.

   Justification:  The intent of this requirement is to minimize the
      cases where a NAT abandons session state for a live connection.
      While some NATs may choose to abandon sessions reactively in
      response to new connection initiations (allowing idle connections
      to stay up indefinitely in the absence of new initiations), other
      NATs may choose to proactively reap idle sessions.  In cases where
      the NAT cannot actively determine if the connection is alive, this
      requirement ensures that applications can send keep-alive packets
      at the default rate (every 2 hours) such that the NAT can
      passively determine that the connection is alive.  The additional
      4 minutes allows time for in-flight packets to cross the NAT.

It does mention leave abandoning live connections unspecified, but encourages notification unless unable:

NAT behavior for notifying endpoints when abandoning live connections
   is left unspecified.  When a NAT abandons a live connection, for
   example due to a timeout expiring, the NAT MAY either send TCP RST
   packets to the endpoints or MAY silently abandon the connection.

   Sending a RST notification allows endpoint applications to recover
   more quickly; however, notifying the endpoints may not always be
   possible if, for example, session state is lost due to a power
   failure.

So is seems Azure basically took the 4 minute transitory connection idle-timeout and made that the established connection idle-timeout. (ballsy when you notice the "MUST NOT be less than 2 hours"). On top of that, they went for "silently abandon the connection", considering it perfectly normal.

RFC7857, which updates 5382, doesn't have much more to say on that particular subject, but does clarify handling of RST packets and re-iterates why you shouldn't drop established connections.

mberhault commented 7 years ago

And back to the 15 minute wedge: by default, linux performs at most 15 retransmits, with a maximum exponential backoff of 1min (initial retransmit timeout starts at 1s, but depends on the RTT). This gets pretty close to the (approximate) 15 minute timeout before setting up a new connection.

So on top of the keep-alive, it would be nice to be able to set the timeout through libpq. Of course, even if both of those exist, the defaults would most likely remain (no keep-alive, long timeouts) and all users would have to set those properly. That's assuming that all postgres libraries support those settings. We can definitely recommend them, keep-alives definitely don't hurt.

Let's start with the keep-alive. There's already a request against libpq to support it: https://github.com/lib/pq/issues/360

maddyblue commented 7 years ago

Running a large RESTORE on a real cluster (on GCE at the time) would wedge and fail to complete for an unknown reason. This could be it. I will work on the linked lib/pq issue.

mberhault commented 7 years ago

Azure's response: this is be design and will not be changed. I've been politely asked to file it as a suggestion. Done: https://feedback.azure.com/forums/34192--general-feedback/suggestions/18574540-send-rst-when-dropping-established-connections-aft

mberhault commented 7 years ago

We can set tcp keep-alive on the server-side, as in #14063. Testing with/without keep-alive enabled, I get the following using an insecure cockroach node and connecting to it through the public IP:

Without:

cockroach@shared-0001:~$ ./gotests --sleep=300s 'postgres://root@crdb.io:26257/?sslmode=disable'
2017/03/10 12:16:10 Connecting to postgres://root@crdb.io:26257/?sslmode=disable
2017/03/10 12:16:10 Executing SHOW DATABASES
2017/03/10 12:16:10 Sleeping 5m0s
2017/03/10 12:21:10 Executing SHOW DATABASES
2017/03/10 12:21:15 Wedged!!!

TCP dump:

With keep-alive enabled, we can clearly see the keep alive packets every minute. The final show databases after 5 minutes goes through just fine:

$ ./gotests --sleep=300s 'postgres://root@crdb.io:26257/?sslmode=disable'
2017/03/10 12:23:39 Connecting to postgres://root@crdb.io:26257/?sslmode=disable
2017/03/10 12:23:39 Executing SHOW DATABASES
2017/03/10 12:23:39 Sleeping 5m0s
2017/03/10 12:28:39 Executing SHOW DATABASES
2017/03/10 12:28:39 Not wedged

TCP dump:

maddyblue commented 7 years ago

Postgres supports keepalive initiated by the server (https://www.postgresql.org/docs/current/static/runtime-config-connection.html) and/or the psql client. It may be better to implement this on the cockroach side instead of lib/pq. Thoughts?

tamird commented 7 years ago

https://github.com/cockroachdb/cockroach/pull/14063

On Mon, Mar 13, 2017 at 1:51 AM, Matt Jibson notifications@github.com wrote:

Postgres supports keepalive initiated by the server ( https://www.postgresql.org/docs/current/static/runtime- config-connection.html) and/or the psql client. It may be better to implement this on the cockroach side instead of lib/pq. Thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13823#issuecomment-286020903, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPIymofxZIycG8cKmfxeOU_YCIN6vks5rlNlGgaJpZM4MNf5J .

mberhault commented 7 years ago

14063 is merged, and server-side TCP keep-alive is enabled and set to 60s by default on pgwire connections.

Things left to do:

we may still want to push to get client-side keep-alive in go/lib/pq.
Azure claims this is by design and will not be fixed. But do feel free to up-vote the suggestion to at least send RST instead of dropping connections on the floor: https://feedback.azure.com/forums/34192--general-feedback/suggestions/18574540-send-rst-when-dropping-established-connections-aft
we may want to look into and document tcp retransmission settings. This is what caused the 15 minutes client hiccup. Those things are way too conservative.

mberhault commented 7 years ago

I think we're ok for now. pg-level keep-alive settings probably wouldn't help much, and tcp transmission timeouts are kernel settings, so at best they're "production recommendations".

cockroachdb / cockroach

sql/pgwire: previously-idle connection blocks for ~15 minutes before returning successfully #13823

0: establishing first connection and issuing basic queries:

+250s: while second query is wedged

14063 is merged, and server-side TCP keep-alive is enabled and set to 60s by default on pgwire connections.