Closed mberhault closed 7 years ago
@mjibson: you're probably the one most familiar with libpq. It could be the culprit, but could just as easily be server-side, or deep in the guts of TLS.
Which version of Go?
go1.7.3 linux/amd64 for the client server is at:
Build Tag: beta-20170216
Build Time: 2017/02/16 13:45:12
Distribution: CCL
Platform: linux amd64
Go Version: go1.7.5
C Compiler: gcc 4.9.3
Build SHA-1: 019eeb7b64386a9d751d2055d2af52f24a93ab3d
Build Type: release
There were lots of changes in database/sql in 1.8. Is it easy to test with that to see if we get a repro? I looked briefly at lib/pq and this stack track but nothing obvious came up.
will do.
Tried again with a client built using go 1.8, same behavior. Additionally, I went and upgraded the cockroach nodes to last week's beta while the client was stuck in the state listed above. It remained stuck while all nodes were down (makes sense, we're not specifying a timeout). After bringing the nodes back up, the client is still stuck.
I should mention that while testing things on my local machine, I encountered no issues. The differences were: single node osx binary, insecure mode, and no oauth-proxy (I don't see how oauth-proxy would make a difference, we're already in the lib/pq code). But insecure might change things.
no issues when trying it locally in secure mode. Also tried using a single node's address in the connection string as opposed to the A record that resolves to all of them. No change.
another thing to note (that one definitely on the server side): the sql connections remain open long after the client has retried with another connection, or even restarted. Not sure who to ping about that. @asubiotto maybe?
Will look more into this ASAP.
so far, I've got it down to this test:
cockroach@shared-0001:~$ ./cockroach sql --url 'postgres://shorty@cockroach-catrina-0001.crdb.io:26257/shorty?sslmode=verify-ca&sslrootcert=certs/ca.crt&sslcert=certs/shorty.client.crt&sslkey=certs/shorty.client.key'
shorty@cockroach-catrina-0001.crdb.io:26257/shorty> SHOW TABLES;
+----------+
| Table |
+----------+
| counters |
| urls |
+----------+
(2 rows)
shorty@cockroach-catrina-0001.crdb.io:26257/shorty>
### Wait 5 minutes.
shorty@cockroach-catrina-0001.crdb.io:26257/shorty> SHOW TABLES;
### Hangs!
Seems to be after ~250s. The closest bracket I have is "not wedged" at 245s, "wedged" at 255s. After that, I'll try with the cluster in insecure mode.
oops. this happens in insecure mode too, same amount of time.
A quick timeline of events, specifically established tcp connections:
13.77.108.155:47418
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 192.168.1.4:47418 40.70.216.136:26257 ESTABLISHED
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp6 0 0 192.168.1.4:26257 13.77.108.155:47418 ESTABLISHED
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 131 192.168.1.4:47418 40.70.216.136:26257 ESTABLISHED
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp6 0 0 192.168.1.4:26257 13.77.108.155:47418 ESTABLISHED
This seems to indicate bad communication between the client/server, specifically the server never acking the client data.
My next step is to try from within the same Azure private network (the client and nodes are on two separate networks, but obviously the firewall allows communication).
client goroutine while the second query is wedged:
goroutine 26 [IO wait]:
net.runtime_pollWait(0x7faee0a97e78, 0x72, 0x3)
/home/marc/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4201367d8, 0x72, 0x9cf0a0, 0x9cb628)
/home/marc/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4201367d8, 0xc420077000, 0x1000)
/home/marc/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).Read(0xc420136770, 0xc420077000, 0x1000, 0x1000, 0x0, 0x9cf0a0, 0x9cb628)
/home/marc/go/src/net/fd_unix.go:250 +0x1b7
net.(*conn).Read(0xc42000e0b0, 0xc420077000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/home/marc/go/src/net/net.go:181 +0x70
bufio.(*Reader).Read(0xc42005ab40, 0xc420154020, 0x5, 0x200, 0xc420017300, 0x1, 0x3)
/home/marc/go/src/bufio/bufio.go:213 +0x312
io.ReadAtLeast(0x9ccd20, 0xc42005ab40, 0xc420154020, 0x5, 0x200, 0x5, 0x4110a2, 0xc42011e5a0, 0x20)
/home/marc/go/src/io/io.go:307 +0xa9
io.ReadFull(0x9ccd20, 0xc42005ab40, 0xc420154020, 0x5, 0x200, 0x0, 0xc420026c00, 0x0)
/home/marc/go/src/io/io.go:325 +0x58
github.com/lib/pq.(*conn).recvMessage(0xc420154000, 0xc42011e5a0, 0x7939a0, 0x1, 0xa20e40)
/home/marc/cockroach/src/github.com/lib/pq/conn.go:957 +0x13e
github.com/lib/pq.(*conn).recv1Buf(0xc420154000, 0xc42011e5a0, 0xc420050230)
/home/marc/cockroach/src/github.com/lib/pq/conn.go:1007 +0x39
github.com/lib/pq.(*conn).recv1(0xc420154000, 0x60096f, 0xc42000e0b0)
/home/marc/cockroach/src/github.com/lib/pq/conn.go:1028 +0x85
github.com/lib/pq.(*conn).readParseResponse(0xc420154000)
/home/marc/cockroach/src/github.com/lib/pq/conn.go:1574 +0x2f
github.com/lib/pq.(*conn).prepareTo(0xc420154000, 0x837f00, 0x6e, 0x0, 0x0, 0xc)
/home/marc/cockroach/src/github.com/lib/pq/conn.go:784 +0x5d7
github.com/lib/pq.(*conn).query(0xc420154000, 0x837f00, 0x6e, 0xc420120510, 0x1, 0x1, 0x0, 0x0, 0x0)
/home/marc/cockroach/src/github.com/lib/pq/conn.go:855 +0x34b
github.com/lib/pq.(*conn).QueryContext(0xc420154000, 0x9d2a20, 0xc42001c188, 0x837f00, 0x6e, 0xc4201625d0, 0x1, 0x1, 0x7faee0aa49c0, 0x454d60, ...)
/home/marc/cockroach/src/github.com/lib/pq/conn_go18.go:21 +0x1f1
database/sql.ctxDriverQuery(0x9d2a20, 0xc42001c188, 0x7faee0aa49c0, 0xc420154000, 0x837f00, 0x6e, 0xc4201625d0, 0x1, 0x1, 0x42afce, ...)
/home/marc/go/src/database/sql/ctxutil.go:48 +0x28d
database/sql.(*DB).queryConn.func1()
/home/marc/go/src/database/sql/sql.go:1264 +0x99
database/sql.withLock(0x9cfb20, 0xc420116380, 0xc420045850)
/home/marc/go/src/database/sql/sql.go:2545 +0x65
database/sql.(*DB).queryConn(0xc420138280, 0x9d2a20, 0xc42001c188, 0xc420116380, 0xc420120470, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, ...)
/home/marc/go/src/database/sql/sql.go:1265 +0x671
database/sql.(*DB).query(0xc420138280, 0x9d2a20, 0xc42001c188, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, 0x1, 0xc420026c00, ...)
/home/marc/go/src/database/sql/sql.go:1250 +0x12f
database/sql.(*DB).QueryContext(0xc420138280, 0x9d2a20, 0xc42001c188, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, 0xc420120450, 0xc420045ab8, ...)
/home/marc/go/src/database/sql/sql.go:1227 +0xb8
database/sql.(*DB).Query(0xc420138280, 0x837f00, 0x6e, 0xc420045ac8, 0x1, 0x1, 0x1, 0x3700000000000000, 0x4)
/home/marc/go/src/database/sql/sql.go:1241 +0x82
main.getShortysByOwner(0xc42011e2e0, 0x16, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/db.go:148 +0x158
main.handleNew(0x9d22a0, 0xc4201600e0, 0xc42000a800)
/home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:92 +0xac
main.handleSettings(0x9d22a0, 0xc4201600e0, 0xc42000a800)
/home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:73 +0x1f2
main.handleRoot(0x9d22a0, 0xc4201600e0, 0xc42000a800)
/home/marc/cockroach/src/github.com/cockroachdb/examples-go/shorturl/handlers.go:41 +0xb7
net/http.HandlerFunc.ServeHTTP(0x8398b0, 0x9d22a0, 0xc4201600e0, 0xc42000a800)
/home/marc/go/src/net/http/server.go:1942 +0x44
net/http.(*ServeMux).ServeHTTP(0xa046c0, 0x9d22a0, 0xc4201600e0, 0xc42000a800)
/home/marc/go/src/net/http/server.go:2238 +0x130
net/http.serverHandler.ServeHTTP(0xc42009ed10, 0x9d22a0, 0xc4201600e0, 0xc42000a800)
/home/marc/go/src/net/http/server.go:2568 +0x92
net/http.(*conn).serve(0xc420138460, 0x9d29e0, 0xc42012ca80)
/home/marc/go/src/net/http/server.go:1825 +0x612
created by net/http.(*Server).Serve
/home/marc/go/src/net/http/server.go:2668 +0x2ce
This only occurs when talking to a node using its public address.
eg: cockroach-catrina-0001.crdb.io
resolves to the public address, cockroach-catrina-0001
when on the same private network (Azure terminology) resolves to the private address.
Any attempt to talk to cockroach-catrina-0001.crdb.io
fails, be it from outside the private network, within the private network, or even on the same machine.
However, using cockroach-catrina-0001
(restricted to the private network) works, both on the same machine or another.
Sample run on cockroach-catrina-0001
itself:
# Against public address
cockroach@cockroach-catrina-0001:~$ ./gotests -sleep 300s 'postgres://shorty@cockroach-catrina-0001.crdb.io:26257/shorty?sslmode=disable'
2017/03/08 22:21:06 Connecting to postgres://shorty@cockroach-catrina-0001.crdb.io:26257/shorty?sslmode=disable
2017/03/08 22:21:06 Executing SHOW DATABASES
2017/03/08 22:21:06 Sleeping 5m0s
2017/03/08 22:26:06 Executing SHOW DATABASES
2017/03/08 22:26:11 Wedged!!!
# Against private address
cockroach@cockroach-catrina-0001:~$ ./gotests -sleep 300s 'postgres://shorty@cockroach-catrina-0001:26
257/shorty?sslmode=disable'
2017/03/08 22:26:55 Connecting to postgres://shorty@cockroach-catrina-0001:26257/shorty?sslmode=disable
2017/03/08 22:26:55 Executing SHOW DATABASES
2017/03/08 22:26:55 Sleeping 5m0s
2017/03/08 22:31:55 Executing SHOW DATABASES
2017/03/08 22:31:55 Not wedged
So something's iffy in their firewall. tcpdump is up next.
well crap! https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout which mentions this gem: `
I can tweak the idle timeout for a given public IP. eg:
$ azure network public-ip set cockroach-catrina cockroach-catrina-0001 -i 5
(the 5 is idle minutes).
However, my wedginess detector (TM) still fails below minutes. Could be a config propagation issue.
It looks like this may trigger a few things:
On Wed, Mar 8, 2017 at 5:54 PM, marc notifications@github.com wrote:
figure out how to do keep-alives in pgwire
Would TCP keepalives work? That's all I could find in the pgwire docs.
they should, I'm testing that now. mongo ran into the same thing: https://docs.mongodb.com/manual/administration/production-notes/#windows-azure-production-notes
this isn't just the CLI, this is any client whatsoever. there are other bugs here too, see the list a few comments earlier.
ok, changing the kernel keepalive settings ($ sysctl -w net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 120
) doesn't help, the connection needs to be setup with keepalive enable. bleh!
Ah, I see; good point. I think we may need to take this upstream with
support for keepalives
and friends [0].
[0] https://www.postgresql.org/docs/9.3/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS
On Wed, Mar 8, 2017 at 6:12 PM, marc notifications@github.com wrote:
ok, changing the kernel keepalive settings ($ sysctl -w net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_time = 120) doesn't help, the connection needs to be setup with keepalive enable. bleh!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Yup. @mjibson seems to have some pull over there :) in the meantime, I'll add this to my long list of "the cloud sucks!" gripes and ping Azure
Heh, I have commit over there as well.
On Wed, Mar 8, 2017 at 6:19 PM, marc notifications@github.com wrote:
Yup. @mjibson https://github.com/mjibson seems to have some pull over there :) in the meantime, I'll add this to my long list of "the cloud sucks!" and ping Azure
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13823#issuecomment-285202290, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPOrrX0OuuhPipAsMZKzzl6L-8Q8vks5rjzdogaJpZM4MNf5J .
@tamird why did you close that keep alive PR? Did it not work?
It wasn't general enough; it doesn't fix other clients, say.
On Mar 8, 2017 23:25, "Matt Jibson" notifications@github.com wrote:
@tamird https://github.com/tamird why did you close that keep alive PR? Did it not work?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13823#issuecomment-285251203, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPPl0GrA2Ji-BIEDM9I6yIW9ZQup8ks5rj383gaJpZM4MNf5J .
Ok, I went for barebones tcp, and the connection indeed gets dropped on the floor, both from a client and server pov.
on shared-0001
(external address: crdb.io), I ran (at the same time):
# capture all packets to/from tcp port 8888
$ sudo tcpdump -i any 'tcp port 8888' -s 65535 -w server.dump
# Basic "echo" service on tcp port 8888
$ ncat -l 192.168.1.4 8888 -k -e /bin/cat
On my machine, I ran (at the same time):
# capture all packets to/from tcp port 8888
sudo tcpdump -i any 'tcp port 8888' -s 65535 -w client.dump
# send stuff to crbd.io on tcp 8888:
$ cat tcp_sleep.sh
echo "starting..."
sleep 255
echo "done..."
$ ./tcp_sleep.sh | ncat crdb.io 8888 -v -v -v
Ncat: Version 7.01 ( https://nmap.org/ncat )
NCAT DEBUG: Using system default trusted CA certificates and those in /etc/ssl/certs/ca-certificates.crt.
libnsock nsock_iod_new2(): nsock_iod_new (IOD #1)
libnsock nsock_connect_tcp(): TCP connection requested to 13.77.108.155:8888 (IOD #1) EID 8
libnsock nsock_trace_handler_callback(): Callback: CONNECT SUCCESS for EID 8 [13.77.108.155:8888]
Ncat: Connected to 13.77.108.155:8888.
libnsock nsock_iod_new2(): nsock_iod_new (IOD #2)
libnsock nsock_read(): Read request from IOD #1 [13.77.108.155:8888] (timeout: -1ms) EID 18
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #2 [peer unspecified] EID 26
libnsock nsock_trace_handler_callback(): Callback: READ SUCCESS for EID 26 [peer unspecified] (12 bytes): starting....
libnsock nsock_write(): Write request for 12 bytes to IOD #1 EID 35 [13.77.108.155:8888]
libnsock nsock_trace_handler_callback(): Callback: WRITE SUCCESS for EID 35 [13.77.108.155:8888]
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #2 [peer unspecified] EID 42
libnsock nsock_trace_handler_callback(): Callback: READ SUCCESS for EID 18 [13.77.108.155:8888] (12 bytes): starting....
starting...
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #1 [13.77.108.155:8888] EID 50
libnsock nsock_trace_handler_callback(): Callback: READ SUCCESS for EID 42 [peer unspecified] (8 bytes): done....
libnsock nsock_write(): Write request for 8 bytes to IOD #1 EID 59 [13.77.108.155:8888]
libnsock nsock_trace_handler_callback(): Callback: WRITE SUCCESS for EID 59 [13.77.108.155:8888]
libnsock nsock_readbytes(): Read request for 0 bytes from IOD #2 [peer unspecified] EID 66
libnsock nsock_trace_handler_callback(): Callback: READ EOF for EID 66 [peer unspecified]
# Just hangs.
Server side tcp capture: nothing shows up after the initial connection, not even a RST.
Client side tcp capture: nothing responds to the data sent after 4 minutes, but we keep trying.
So yeah, Azure doesn't do networking.
Just going to jot this down real quick:
RFC5382: NAT Behavioral Requirements for TCP established connection idle-timeout:
The "established connection idle-timeout" for a NAT is defined as the
minimum time a TCP connection in the established phase must remain
idle before the NAT considers the associated session a candidate for
removal. The "transitory connection idle-timeout" for a NAT is
defined as the minimum time a TCP connection in the partially open or
closing phases must remain idle before the NAT considers the
associated session a candidate for removal. TCP connections in the
TIME_WAIT state are not affected by the "transitory connection idle-
timeout".
REQ-5: If a NAT cannot determine whether the endpoints of a TCP
connection are active, it MAY abandon the session if it has been
idle for some time. In such cases, the value of the "established
connection idle-timeout" MUST NOT be less than 2 hours 4 minutes.
The value of the "transitory connection idle-timeout" MUST NOT be
less than 4 minutes.
a) The value of the NAT idle-timeouts MAY be configurable.
Justification: The intent of this requirement is to minimize the
cases where a NAT abandons session state for a live connection.
While some NATs may choose to abandon sessions reactively in
response to new connection initiations (allowing idle connections
to stay up indefinitely in the absence of new initiations), other
NATs may choose to proactively reap idle sessions. In cases where
the NAT cannot actively determine if the connection is alive, this
requirement ensures that applications can send keep-alive packets
at the default rate (every 2 hours) such that the NAT can
passively determine that the connection is alive. The additional
4 minutes allows time for in-flight packets to cross the NAT.
It does mention leave abandoning live connections unspecified, but encourages notification unless unable:
NAT behavior for notifying endpoints when abandoning live connections
is left unspecified. When a NAT abandons a live connection, for
example due to a timeout expiring, the NAT MAY either send TCP RST
packets to the endpoints or MAY silently abandon the connection.
Sending a RST notification allows endpoint applications to recover
more quickly; however, notifying the endpoints may not always be
possible if, for example, session state is lost due to a power
failure.
So is seems Azure basically took the 4 minute transitory connection idle-timeout and made that the established connection idle-timeout. (ballsy when you notice the "MUST NOT be less than 2 hours"). On top of that, they went for "silently abandon the connection", considering it perfectly normal.
RFC7857, which updates 5382, doesn't have much more to say on that particular subject, but does clarify handling of RST packets and re-iterates why you shouldn't drop established connections.
And back to the 15 minute wedge: by default, linux performs at most 15 retransmits, with a maximum exponential backoff of 1min (initial retransmit timeout starts at 1s, but depends on the RTT). This gets pretty close to the (approximate) 15 minute timeout before setting up a new connection.
So on top of the keep-alive, it would be nice to be able to set the timeout through libpq. Of course, even if both of those exist, the defaults would most likely remain (no keep-alive, long timeouts) and all users would have to set those properly. That's assuming that all postgres libraries support those settings. We can definitely recommend them, keep-alives definitely don't hurt.
Let's start with the keep-alive. There's already a request against libpq to support it: https://github.com/lib/pq/issues/360
Running a large RESTORE
on a real cluster (on GCE at the time) would wedge and fail to complete for an unknown reason. This could be it. I will work on the linked lib/pq issue.
Azure's response: this is be design and will not be changed. I've been politely asked to file it as a suggestion. Done: https://feedback.azure.com/forums/34192--general-feedback/suggestions/18574540-send-rst-when-dropping-established-connections-aft
We can set tcp keep-alive on the server-side, as in #14063. Testing with/without keep-alive enabled, I get the following using an insecure cockroach node and connecting to it through the public IP:
Without:
cockroach@shared-0001:~$ ./gotests --sleep=300s 'postgres://root@crdb.io:26257/?sslmode=disable'
2017/03/10 12:16:10 Connecting to postgres://root@crdb.io:26257/?sslmode=disable
2017/03/10 12:16:10 Executing SHOW DATABASES
2017/03/10 12:16:10 Sleeping 5m0s
2017/03/10 12:21:10 Executing SHOW DATABASES
2017/03/10 12:21:15 Wedged!!!
TCP dump:
With keep-alive enabled, we can clearly see the keep alive packets every minute. The final show databases
after 5 minutes goes through just fine:
$ ./gotests --sleep=300s 'postgres://root@crdb.io:26257/?sslmode=disable'
2017/03/10 12:23:39 Connecting to postgres://root@crdb.io:26257/?sslmode=disable
2017/03/10 12:23:39 Executing SHOW DATABASES
2017/03/10 12:23:39 Sleeping 5m0s
2017/03/10 12:28:39 Executing SHOW DATABASES
2017/03/10 12:28:39 Not wedged
TCP dump:
Postgres supports keepalive initiated by the server (https://www.postgresql.org/docs/current/static/runtime-config-connection.html) and/or the psql client. It may be better to implement this on the cockroach side instead of lib/pq. Thoughts?
https://github.com/cockroachdb/cockroach/pull/14063
On Mon, Mar 13, 2017 at 1:51 AM, Matt Jibson notifications@github.com wrote:
Postgres supports keepalive initiated by the server ( https://www.postgresql.org/docs/current/static/runtime- config-connection.html) and/or the psql client. It may be better to implement this on the cockroach side instead of lib/pq. Thoughts?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/13823#issuecomment-286020903, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdsPIymofxZIycG8cKmfxeOU_YCIN6vks5rlNlGgaJpZM4MNf5J .
Things left to do:
I think we're ok for now. pg-level keep-alive settings probably wouldn't help much, and tcp transmission timeouts are kernel settings, so at best they're "production recommendations".
This could be anything from libpq, TLS connection code, or the server side.
cockroach sha: 019eeb7b64386a9d751d2055d2af52f24a93ab3d (beta-20170216) libpq sha: ba5d4f7a35561e22fbdf7a39aa0070f4d460cfc0 (latest as of now)
The shorturl example creates a
sql.DB
object at startup and reuses it forever. The load is incredibly rare, some at startup (create table if not exists
and a few inserts/updates), then only upon requests tohttps://crdb.io
.When no requests have been issued for a while (seemingly just a few minutes), the first request issued will block for ~15 minutes, then finally return successfully, and with the right statements results.
The client goroutine profile shows it sitting in:
oauth_proxy log:
Showing 935s before successful response. Strangely enough, slow responses always seem to be 935.X seconds. No others were found
Chrome shows:
During that time,
netstat
shows a single active tcp connection from the client to the cockroach cluster:However, that one cockroach node (
cockroach-catrina-0003
) shows many:No particular messages in the cockroach node logs.
/debug/events
shows nothing out of the ordinary, the time spent in sql is reasonable:However, the number of sql connections reported by cockroach is continuously increasing, even long after the client has been restarted.