Closed joshimhoff closed 2 years ago
I have various Qs about this still:
Though it appears to be possible to check if a TCP connection is closed without sending / receiving over the network (https://stackoverflow.com/questions/12741386/how-to-know-tcp-connection-is-closed-in-net-package), some connection pool implementations (possibly/probably (?) many/all but at least hikari) don't do this.
Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found? That seems possible to me still & could explain why we don't see evidence of this happening widely, despite many customers running in production.
Post hitting this error, connection pool implementations stop using the closed connection, but the above issues do bubble up the application layer, so there is some impact on availability.
I don't really understand what hirari & other connection pool implementations do here. When they hit I/O timeout errors specifically, they then check health of connection & if that second check fails, then they recycle the connection? In a way this ties into my first Q. I wouldn't mind links to the code to answer this second Q.
Perhaps @rafiss has thoughts (tho no rush Rafi... I know you are busy).
Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found?
One thing missing from the issue summary here is that connection pools already check for the health of a connection before using it. Specifically, here is the order of events when the errors happen.
Here is the getConnection code in Hikari. Note that it calls isConnectionAlive
. And here you can see the isConnectionAlive implementation.
After the connection pool gives the connection to the application, then the connection pool can no longer perform the check you are proposing. The application itself could perform the check each time it tries to use the connection (a connection which was already deemed "healthy"), but that would be incredibly onerous in the code.
But that brings me to the next point -- how would doing the check even help? The client will end up in the same situation:
The important point is that there's no general retry logic that can be used. If the application is in the middle of a transaction when it gets the error, then the connection pool wouldn't know that it has to retry all the previous statements from that transaction. Or if the application had already finished one transaction, but got the error when doing a second, the connection pool wouldn't know what's already succeeded and what hasn't.
I don't really understand what hirari & other connection pool implementations do here. When they hit I/O timeout errors specifically, they then check health of connection & if that second check fails, then they recycle the connection? In a way this ties into my first Q. I wouldn't mind links to the code to answer this second Q.
The pool always is checking the health of a connection when it gives it to the application. Specifically in Hikari, this block will return false for isConnectionAlive. Also, more to your question, if an error happens while the connection is in use, then the logic of Hikari's ProxyConnection will evict that connection from the pool.
It would be helpful to understand what we want out of this issue beyond what is described in https://github.com/cockroachdb/cockroach/issues/66319
It would be helpful to understand what we want out of this issue beyond what is described in #66319
I think we want a shared mental model of the problem, rather than a feature request that may improve the situation.
If the application is in the middle of a transaction when it gets the error, then the connection pool wouldn't know that it has to retry all the previous statements from that transaction.
The fact that the need to close a conn in the middle of a txn leads to an unavailability blip makes sense to me. Customers need to not have overly long running queries if they want zero downtime. We can make that recommendation, that is, txns must finish within query_wait
seconds (or something like that).
I am thinking only about idle conns. The closing of an idle conn can currently lead to an availability blip, and I don't understand why this is necessary.
Or if the application had already finished one transaction, but got the error when doing a second, the connection pool wouldn't know what's already succeeded and what hasn't.
Perhaps this is the answer to my Q but I don't understand it.
This connection is given to the application.
Or this? Is a specific connection literally returned to the application always in hikari land? Let's look at https://golang.org/pkg/database/sql/#DB. You can ask for a specific conn via https://golang.org/pkg/database/sql/#DB.Conn but why would you do that? That opens you up this failure mode & also failures when nodes dies randomly. But if you query always via *DB
(e.g. https://golang.org/src/database/sql/sql.go?s=42508:42577#L1538), then you are not committing to sending over a specific conn. In this case, if the someone begins executing a SQL query and that query immediately fails in a way that indicates the connection has been closed, I don't see why the database/sql
can't retry without bubbling the error to the application layer, simply by choosing another conn. Do you see what I mean? Or am I getting confused?
Trying to understand whether this issue could hit a user of golang database/sql
& why. This is interesting:
// ExecContext executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) ExecContext(ctx context.Context, query string, args ...interface{}) (Result, error) {
var res Result
var err error
for i := 0; i < maxBadConnRetries; i++ {
res, err = db.exec(ctx, query, args, cachedOrNewConn)
if err != driver.ErrBadConn {
break
}
}
if err == driver.ErrBadConn {
return db.exec(ctx, query, args, alwaysNewConn)
}
return res, err
}
Drilling deeper:
func (db *DB) exec(ctx context.Context, query string, args []interface{}, strategy connReuseStrategy) (Result, error) {
dc, err := db.conn(ctx, strategy)
if err != nil {
return nil, err
}
return db.execDC(ctx, dc, dc.releaseConn, query, args)
}
Also this:
// maxBadConnRetries is the number of maximum retries if the driver returns
// driver.ErrBadConn to signal a broken connection before forcing a new
// connection to be opened.
const maxBadConnRetries = 2
There is a retry loop for getting a usable connection.
From https://golang.org/src/database/sql/sql.go.
What does driver.ErrBadConn
mean exactly?
// conn returns a newly-opened or cached *driverConn.
func (db *DB) conn(ctx context.Context, strategy connReuseStrategy) (*driverConn, error) {
db.mu.Lock()
if db.closed {
db.mu.Unlock()
return nil, errDBClosed
}
// Check if the context is expired.
select {
default:
case <-ctx.Done():
db.mu.Unlock()
return nil, ctx.Err()
}
lifetime := db.maxLifetime
// Prefer a free connection, if possible.
numFree := len(db.freeConn)
if strategy == cachedOrNewConn && numFree > 0 {
conn := db.freeConn[0]
copy(db.freeConn, db.freeConn[1:])
db.freeConn = db.freeConn[:numFree-1]
conn.inUse = true
if conn.expired(lifetime) {
db.maxLifetimeClosed++
db.mu.Unlock()
conn.Close()
return nil, driver.ErrBadConn
}
db.mu.Unlock()
// Reset the session if required.
if err := conn.resetSession(ctx); err == driver.ErrBadConn {
conn.Close()
return nil, driver.ErrBadConn
}
return conn, nil
}
// Out of free connections or we were asked not to use one. If we're not
// allowed to open any more connections, make a request and wait.
if db.maxOpen > 0 && db.numOpen >= db.maxOpen {
// Make the connRequest channel. It's buffered so that the
// connectionOpener doesn't block while waiting for the req to be read.
req := make(chan connRequest, 1)
reqKey := db.nextRequestKeyLocked()
db.connRequests[reqKey] = req
db.waitCount++
db.mu.Unlock()
waitStart := nowFunc()
// Timeout the connection request with the context.
select {
case <-ctx.Done():
// Remove the connection request and ensure no value has been sent
// on it after removing.
db.mu.Lock()
delete(db.connRequests, reqKey)
db.mu.Unlock()
atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))
select {
default:
case ret, ok := <-req:
if ok && ret.conn != nil {
db.putConn(ret.conn, ret.err, false)
}
}
return nil, ctx.Err()
case ret, ok := <-req:
atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))
if !ok {
return nil, errDBClosed
}
// Only check if the connection is expired if the strategy is cachedOrNewConns.
// If we require a new connection, just re-use the connection without looking
// at the expiry time. If it is expired, it will be checked when it is placed
// back into the connection pool.
// This prioritizes giving a valid connection to a client over the exact connection
// lifetime, which could expire exactly after this point anyway.
if strategy == cachedOrNewConn && ret.err == nil && ret.conn.expired(lifetime) {
db.mu.Lock()
db.maxLifetimeClosed++
db.mu.Unlock()
ret.conn.Close()
return nil, driver.ErrBadConn
}
if ret.conn == nil {
return nil, ret.err
}
// Reset the session if required.
if err := ret.conn.resetSession(ctx); err == driver.ErrBadConn {
ret.conn.Close()
return nil, driver.ErrBadConn
}
return ret.conn, ret.err
}
}
db.numOpen++ // optimistically
db.mu.Unlock()
ci, err := db.connector.Connect(ctx)
if err != nil {
db.mu.Lock()
db.numOpen-- // correct for earlier optimism
db.maybeOpenNewConnections()
db.mu.Unlock()
return nil, err
}
db.mu.Lock()
dc := &driverConn{
db: db,
createdAt: nowFunc(),
returnedAt: nowFunc(),
ci: ci,
inUse: true,
}
db.addDepLocked(dc, dc)
db.mu.Unlock()
return dc, nil
}
Focusing on the use an existing free conn bit:
// Prefer a free connection, if possible.
numFree := len(db.freeConn)
if strategy == cachedOrNewConn && numFree > 0 {
conn := db.freeConn[0]
copy(db.freeConn, db.freeConn[1:])
db.freeConn = db.freeConn[:numFree-1]
conn.inUse = true
if conn.expired(lifetime) {
db.maxLifetimeClosed++
db.mu.Unlock()
conn.Close()
return nil, driver.ErrBadConn
}
db.mu.Unlock()
// Reset the session if required.
if err := conn.resetSession(ctx); err == driver.ErrBadConn {
conn.Close()
return nil, driver.ErrBadConn
}
return conn, nil
}
Return driver.ErrBadConn
if conn is too old or if "resetting the session" returns driver.ErrBadConn
. Maybe calling resetSession
fails with driver.ErrBadConn
in case CRDB has closed the conn on the server side already? In which case only way to hit this issue when using *DB
's ExecContext
is a race where resetSession
runs right before CRBD closes all conns.
Looks about right:
// SessionResetter may be implemented by Conn to allow drivers to reset the
// session state associated with the connection and to signal a bad connection.
type SessionResetter interface {
// ResetSession is called prior to executing a query on the connection
// if the connection has been used before. If the driver returns ErrBadConn
// the connection is discarded.
ResetSession(ctx context.Context) error
}
Specifically this part in the docs: "... to signal a bad connection".
From https://golang.org/src/database/sql/driver/driver.go
In which case only way to hit this issue when using
*DB
'sExecContext
is a race whereresetSession
runs right before CRBD closes all conns.
So a user of database/sql
can be affected. But perhaps not very likely?
OTOH, if you call Connection
(https://golang.org/pkg/database/sql/#DB.Conn) & hold onto it for some time, you can increase the likelihood of being affected by above bug quite a lot, as there is more time for conn to be closed between resetSession
call and the using of the conn. Minimizing how long you hold onto an explicit connection -- or actually ideally avoiding it all together (use *DB
instead) -- seems very wise to me. Possibly some of our customers don't do that.
Not sure I have this right but I am learning things at least...
Also, @rafiss, after reading database/sql
, I see what you mean when you say:
One thing missing from the issue summary here is that connection pools already check for the health of a connection before using it.
I'll update the issue summary.
Wondering if you have any thoughts about the database/sql
analysis above. In particular this bit as I could see it contributing to the customer issue we are looking at:
OTOH, if you call Connection (https://golang.org/pkg/database/sql/#DB.Conn) & hold onto it for some time, you can increase the likelihood of being affected by above bug quite a lot, as there is more time for conn to be closed between resetSession call and the using of the conn. Minimizing how long you hold onto an explicit connection -- or actually ideally avoiding it all together (use
*DB
instead) -- seems very wise to me. Possibly some of our customers don't do that.
Now I am looking at the pgx
driver for database/sql
and I found something very interesting!! https://github.com/jackc/pgx
Look at this beautiful thing:
func (c *Conn) ExecContext(ctx context.Context, query string, argsV []driver.NamedValue) (driver.Result, error) {
if c.conn.IsClosed() {
return nil, driver.ErrBadConn
}
args := namedValueToInterface(argsV)
commandTag, err := c.conn.Exec(ctx, query, args...)
// if we got a network error before we had a chance to send the query, retry
if err != nil {
if pgconn.SafeToRetry(err) {
return nil, driver.ErrBadConn
}
}
return driver.RowsAffected(commandTag.RowsAffected()), err
}
Combine above with below from database/sql
:
// ExecContext executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) ExecContext(ctx context.Context, query string, args ...interface{}) (Result, error) {
var res Result
var err error
for i := 0; i < maxBadConnRetries; i++ {
res, err = db.exec(ctx, query, args, cachedOrNewConn)
if err != driver.ErrBadConn {
break
}
}
if err == driver.ErrBadConn {
return db.exec(ctx, query, args, alwaysNewConn)
}
return res, err
}
I think if you use database/sql
, pgx
as a driver, and query via DB.ExecContext
(query against *DB
& don't open a txn), you will NOT ever hit the bug described in this ticket. If a conn is closed by CRBD, pgx
will return driver.ErrBadConn
, leading to a retry. This is what I was hoping for when I wrote:
Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found?
(Maybe there is a race if query happens right as conn is being closed meaning conn is not fully closed yet tho. Sounds like it as per "if we got a network error before we had a chance to send the query".)
Interestingly, I think you CAN hit the failure mode when trying to open a txn, as there is no returning of driver.ErrBadConn
(except when you already know the conn is closed without sending over the network (that is, the client closed the conn, not the server)):
func (c *Conn) BeginTx(ctx context.Context, opts driver.TxOptions) (driver.Tx, error) {
if c.conn.IsClosed() {
return nil, driver.ErrBadConn
}
if pconn, ok := ctx.Value(ctxKeyFakeTx).(**pgx.Conn); ok {
*pconn = c.conn
return fakeTx{}, nil
}
var pgxOpts pgx.TxOptions
switch sql.IsolationLevel(opts.Isolation) {
case sql.LevelDefault:
case sql.LevelReadUncommitted:
pgxOpts.IsoLevel = pgx.ReadUncommitted
case sql.LevelReadCommitted:
pgxOpts.IsoLevel = pgx.ReadCommitted
case sql.LevelRepeatableRead, sql.LevelSnapshot:
pgxOpts.IsoLevel = pgx.RepeatableRead
case sql.LevelSerializable:
pgxOpts.IsoLevel = pgx.Serializable
default:
return nil, fmt.Errorf("unsupported isolation: %v", opts.Isolation)
}
if opts.ReadOnly {
pgxOpts.AccessMode = pgx.ReadOnly
}
tx, err := c.conn.BeginTx(ctx, pgxOpts)
if err != nil {
return nil, err
}
return wrapTx{ctx: ctx, tx: tx}, nil
}
I updated the issue summary as per feedback from Rafi!
A very long term idea: we could think about implementing live TCP migrations. There are systems that allow you to move a TCP connection from one node to another, with no interruption to the other end.
https://twitter.com/colmmacc/status/1407820679962054662?s=20
Soft of relevant database/sql
issue posted by @JeffSwenson: https://github.com/golang/go/issues/11978
(Maybe there is a race if query happens right as conn is being closed meaning conn is not fully closed yet tho. Sounds like it as per "if we got a network error before we had a chance to send the query".)
In other words, yes, there is a race.
@rafiss @joshimhoff for my own education, can you tell me how this issue and #66319 differ from each other?
I think #66319 proposes a modification to CRBD (see title especially) that may help with the issue outlined in this ticket. If we want to merge things somehow, that SGTM, tho I think it's reasonable to not given above. I feel even more that the details in this ticket re: the nature of the problem should be stored somewhere, as it took a long time for me, Rafi, & Olvier get on the same page about the problem, as it is rather subtle.
I wrote a toy test program, and found that using the pgx
driver with database/sql
reproduces the server is shutting down error. I ran the code below against a local 3-node cluster behind HAProxy, then started a graceful shutdown of one of the nodes.
Note the error does not happen when using database/sql
with lib/pq
. (In the above code replace sql.Open("pgx", url)
with sql.Open("postgres", url)
.) This is because the error handling at https://github.com/lib/pq/blob/9e747ca50601fcb6c958dd89f4cb8aea3e067767/error.go#L481 seems to return ErrBadConn
when CRDB is shutting down, which allows the retry logic that @joshimhoff described above to work.
The error does happen also when using pgx
with pgxpool
(see next snippet for code). So it looks like the pgx
detection for "safe to retry" doesn't handle the way CRDB shuts down. My reading of https://github.com/jackc/pgconn/blob/53f5fed36c570f0b5c98d6ec2415658c7b9bd11c/pgconn.go#L949 is that pgx
only marks an error as safe to retry if 0 bytes have been written to the socket.
NICE REPRO!
found that using the pgx driver with database/sql reproduces the server is shutting down error
Does it repro both ERROR: server is shutting down
AND I/O timeouts post that due to conn closure (as happened with the earlier CC customer)? Otherwise the issue could be more:
But returning ERROR: server is shutting down first means a bit of unavailability ahead of the conn being closed, as database/sql can’t retry around that error. Thoughts, Rafi?
I believe the thinking is that since this is an “admin shutdown” error (see 57P01 on Appendix A. PostgreSQL Error Codes ), the client should be able to reliably retry around this error. But of course that depends on the driver, and I don’t know which drivers are smart about 57P01 and which are not.
That is, perhaps the pgx driver with database/sql doesn't retry around ERROR: server is shutting down
but does around trying to send to a closed conn?
We could experiment by patching them to a special build that doesn't ever return ERROR: server is shutting down
. Or actually we could just use your repro!!
closing as part of Epic CRDB-10458
Describe the problem
As discovered by @rafiss & @otan, draining SQL connections, as is done during CRDB graceful updates, leads to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB.
Our understanding of the mechanics is as follows:
drain_wait
period (default 0s), where connections can no longer be established but existing connections motor on. /healthz will return unavailable, so LBs can mark the node as unavailable.query_wait
(default 10s), where CRDB attempts to close connections that go from non-idle to idle during query execution.query_wait
, CRDB closes open SQL connections - idle or not.Note there are some mitigating factors, which can sort of be understood as a way fo communicating a conn is not be used to the client, but there are races and so some availability blip can still happen:
aliveBypassWindowMs
time has passed since last health check of conn, a health check is run. This health check will fail if the conn is already closed. Then no availability blip. The issue is there is a race: The health check may pass then after that CRDB may close the conn. Then there is an availability blip.database/sql
withpgx
driver: Not sure if there a blackbox health check conn before use when grabbing an idle already used conn like with hikari. OTOH, see https://github.com/cockroachdb/cockroach/issues/67071#issuecomment-872212976 for details on how if errors before sending over the network (presumably if conn is gracefully closed this happens), retries are made bydatabase/sql
in a way that is transparent to the client. But there is also a race, which is sort of discussed here: https://github.com/golang/go/issues/11978. If the conn is closed post sending on the socket, it's not possible to retry. Thanks, @JeffSwenson, for posting this issue.In summary, this failure mode is prob present in all or almost all conn pool setups, but magnitude of unavailability blip prob depends on conn pool details & workload details.
Without additional changes to CRDB (which are discussed in the below doc):
drain_wait
+ estimate of max txn time + estimate of how long takes to remove nodes from the LB's list of healthy nodes, so that by the timedrain_wait
seconds pass, the client has closed all connections to the draining node itself. This means a mix of longer drain time & shorter connection lifetimes; the latter means higher CPU usage esp. if you do password authentication; the former has disadvantages including greater time in a mixed version state.There is a longer form doc about this from @otan and @rafiss: https://docs.google.com/document/d/1oOQ4pnTxEdUPA69qlK5NLCk0C-RldIJeXdlcY1HE3nY/edit#
To Reproduce
From @rafiss re: hikari: https://docs.google.com/document/d/1b7ENgM36xYepaQ-PPG5PESl133m7YcPPatayRGGuAkM/edit
Expected behavior CRBD markets itself as having a zero downtime upgrade process & this ticket points out that to achieve this goal, connection pool settings + CRDB settings need to be set carefully, and also in a way that has definite downsides.
Additional context This is affecting a CC production customer.
gz#8424
Epic CRDB-10458
Jira issue: CRDB-8347