cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.6k stars 3.71k forks source link

Draining SQL connections, as is done during CRDB graceful updates, can lead to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB #67071

Closed joshimhoff closed 2 years ago

joshimhoff commented 3 years ago

Describe the problem

As discovered by @rafiss & @otan, draining SQL connections, as is done during CRDB graceful updates, leads to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB.

Our understanding of the mechanics is as follows:

Note there are some mitigating factors, which can sort of be understood as a way fo communicating a conn is not be used to the client, but there are races and so some availability blip can still happen:

In summary, this failure mode is prob present in all or almost all conn pool setups, but magnitude of unavailability blip prob depends on conn pool details & workload details.

Without additional changes to CRDB (which are discussed in the below doc):

  1. To def avoid: One can set a connection pool max lifetime less than drain_wait + estimate of max txn time + estimate of how long takes to remove nodes from the LB's list of healthy nodes, so that by the time drain_wait seconds pass, the client has closed all connections to the draining node itself. This means a mix of longer drain time & shorter connection lifetimes; the latter means higher CPU usage esp. if you do password authentication; the former has disadvantages including greater time in a mixed version state.
  2. One can accept the small blip in availability, as if our understanding is correct, it should be a relatively small blip.

There is a longer form doc about this from @otan and @rafiss: https://docs.google.com/document/d/1oOQ4pnTxEdUPA69qlK5NLCk0C-RldIJeXdlcY1HE3nY/edit#

To Reproduce

From @rafiss re: hikari: https://docs.google.com/document/d/1b7ENgM36xYepaQ-PPG5PESl133m7YcPPatayRGGuAkM/edit

Expected behavior CRBD markets itself as having a zero downtime upgrade process & this ticket points out that to achieve this goal, connection pool settings + CRDB settings need to be set carefully, and also in a way that has definite downsides.

Additional context This is affecting a CC production customer.

gz#8424

Epic CRDB-10458

Jira issue: CRDB-8347

joshimhoff commented 3 years ago

I have various Qs about this still:

Though it appears to be possible to check if a TCP connection is closed without sending / receiving over the network (https://stackoverflow.com/questions/12741386/how-to-know-tcp-connection-is-closed-in-net-package), some connection pool implementations (possibly/probably (?) many/all but at least hikari) don't do this.

Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found? That seems possible to me still & could explain why we don't see evidence of this happening widely, despite many customers running in production.

Post hitting this error, connection pool implementations stop using the closed connection, but the above issues do bubble up the application layer, so there is some impact on availability.

I don't really understand what hirari & other connection pool implementations do here. When they hit I/O timeout errors specifically, they then check health of connection & if that second check fails, then they recycle the connection? In a way this ties into my first Q. I wouldn't mind links to the code to answer this second Q.

Perhaps @rafiss has thoughts (tho no rush Rafi... I know you are busy).

rafiss commented 3 years ago

Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found?

One thing missing from the issue summary here is that connection pools already check for the health of a connection before using it. Specifically, here is the order of events when the errors happen.

Here is the getConnection code in Hikari. Note that it calls isConnectionAlive. And here you can see the isConnectionAlive implementation.

After the connection pool gives the connection to the application, then the connection pool can no longer perform the check you are proposing. The application itself could perform the check each time it tries to use the connection (a connection which was already deemed "healthy"), but that would be incredibly onerous in the code.

But that brings me to the next point -- how would doing the check even help? The client will end up in the same situation:

The important point is that there's no general retry logic that can be used. If the application is in the middle of a transaction when it gets the error, then the connection pool wouldn't know that it has to retry all the previous statements from that transaction. Or if the application had already finished one transaction, but got the error when doing a second, the connection pool wouldn't know what's already succeeded and what hasn't.


I don't really understand what hirari & other connection pool implementations do here. When they hit I/O timeout errors specifically, they then check health of connection & if that second check fails, then they recycle the connection? In a way this ties into my first Q. I wouldn't mind links to the code to answer this second Q.

The pool always is checking the health of a connection when it gives it to the application. Specifically in Hikari, this block will return false for isConnectionAlive. Also, more to your question, if an error happens while the connection is in use, then the logic of Hikari's ProxyConnection will evict that connection from the pool.


It would be helpful to understand what we want out of this issue beyond what is described in https://github.com/cockroachdb/cockroach/issues/66319

joshimhoff commented 3 years ago

It would be helpful to understand what we want out of this issue beyond what is described in #66319

I think we want a shared mental model of the problem, rather than a feature request that may improve the situation.

If the application is in the middle of a transaction when it gets the error, then the connection pool wouldn't know that it has to retry all the previous statements from that transaction.

The fact that the need to close a conn in the middle of a txn leads to an unavailability blip makes sense to me. Customers need to not have overly long running queries if they want zero downtime. We can make that recommendation, that is, txns must finish within query_wait seconds (or something like that).

I am thinking only about idle conns. The closing of an idle conn can currently lead to an availability blip, and I don't understand why this is necessary.

Or if the application had already finished one transaction, but got the error when doing a second, the connection pool wouldn't know what's already succeeded and what hasn't.

Perhaps this is the answer to my Q but I don't understand it.

This connection is given to the application.

Or this? Is a specific connection literally returned to the application always in hikari land? Let's look at https://golang.org/pkg/database/sql/#DB. You can ask for a specific conn via https://golang.org/pkg/database/sql/#DB.Conn but why would you do that? That opens you up this failure mode & also failures when nodes dies randomly. But if you query always via *DB (e.g. https://golang.org/src/database/sql/sql.go?s=42508:42577#L1538), then you are not committing to sending over a specific conn. In this case, if the someone begins executing a SQL query and that query immediately fails in a way that indicates the connection has been closed, I don't see why the database/sql can't retry without bubbling the error to the application layer, simply by choosing another conn. Do you see what I mean? Or am I getting confused?

joshimhoff commented 3 years ago

Trying to understand whether this issue could hit a user of golang database/sql & why. This is interesting:

// ExecContext executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) ExecContext(ctx context.Context, query string, args ...interface{}) (Result, error) {
    var res Result
    var err error
    for i := 0; i < maxBadConnRetries; i++ {
        res, err = db.exec(ctx, query, args, cachedOrNewConn)
        if err != driver.ErrBadConn {
            break
        }
    }
    if err == driver.ErrBadConn {
        return db.exec(ctx, query, args, alwaysNewConn)
    }
    return res, err
}

Drilling deeper:

func (db *DB) exec(ctx context.Context, query string, args []interface{}, strategy connReuseStrategy) (Result, error) {
    dc, err := db.conn(ctx, strategy)
    if err != nil {
        return nil, err
    }
    return db.execDC(ctx, dc, dc.releaseConn, query, args)
}

Also this:

// maxBadConnRetries is the number of maximum retries if the driver returns
// driver.ErrBadConn to signal a broken connection before forcing a new
// connection to be opened.
const maxBadConnRetries = 2

There is a retry loop for getting a usable connection.

From https://golang.org/src/database/sql/sql.go.

What does driver.ErrBadConn mean exactly?

// conn returns a newly-opened or cached *driverConn.
func (db *DB) conn(ctx context.Context, strategy connReuseStrategy) (*driverConn, error) {
    db.mu.Lock()
    if db.closed {
        db.mu.Unlock()
        return nil, errDBClosed
    }
    // Check if the context is expired.
    select {
    default:
    case <-ctx.Done():
        db.mu.Unlock()
        return nil, ctx.Err()
    }
    lifetime := db.maxLifetime

    // Prefer a free connection, if possible.
    numFree := len(db.freeConn)
    if strategy == cachedOrNewConn && numFree > 0 {
        conn := db.freeConn[0]
        copy(db.freeConn, db.freeConn[1:])
        db.freeConn = db.freeConn[:numFree-1]
        conn.inUse = true
        if conn.expired(lifetime) {
            db.maxLifetimeClosed++
            db.mu.Unlock()
            conn.Close()
            return nil, driver.ErrBadConn
        }
        db.mu.Unlock()

        // Reset the session if required.
        if err := conn.resetSession(ctx); err == driver.ErrBadConn {
            conn.Close()
            return nil, driver.ErrBadConn
        }

        return conn, nil
    }

    // Out of free connections or we were asked not to use one. If we're not
    // allowed to open any more connections, make a request and wait.
    if db.maxOpen > 0 && db.numOpen >= db.maxOpen {
        // Make the connRequest channel. It's buffered so that the
        // connectionOpener doesn't block while waiting for the req to be read.
        req := make(chan connRequest, 1)
        reqKey := db.nextRequestKeyLocked()
        db.connRequests[reqKey] = req
        db.waitCount++
        db.mu.Unlock()

        waitStart := nowFunc()

        // Timeout the connection request with the context.
        select {
        case <-ctx.Done():
            // Remove the connection request and ensure no value has been sent
            // on it after removing.
            db.mu.Lock()
            delete(db.connRequests, reqKey)
            db.mu.Unlock()

            atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))

            select {
            default:
            case ret, ok := <-req:
                if ok && ret.conn != nil {
                    db.putConn(ret.conn, ret.err, false)
                }
            }
            return nil, ctx.Err()
        case ret, ok := <-req:
            atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))

            if !ok {
                return nil, errDBClosed
            }
            // Only check if the connection is expired if the strategy is cachedOrNewConns.
            // If we require a new connection, just re-use the connection without looking
            // at the expiry time. If it is expired, it will be checked when it is placed
            // back into the connection pool.
            // This prioritizes giving a valid connection to a client over the exact connection
            // lifetime, which could expire exactly after this point anyway.
            if strategy == cachedOrNewConn && ret.err == nil && ret.conn.expired(lifetime) {
                db.mu.Lock()
                db.maxLifetimeClosed++
                db.mu.Unlock()
                ret.conn.Close()
                return nil, driver.ErrBadConn
            }
            if ret.conn == nil {
                return nil, ret.err
            }

            // Reset the session if required.
            if err := ret.conn.resetSession(ctx); err == driver.ErrBadConn {
                ret.conn.Close()
                return nil, driver.ErrBadConn
            }
            return ret.conn, ret.err
        }
    }

    db.numOpen++ // optimistically
    db.mu.Unlock()
    ci, err := db.connector.Connect(ctx)
    if err != nil {
        db.mu.Lock()
        db.numOpen-- // correct for earlier optimism
        db.maybeOpenNewConnections()
        db.mu.Unlock()
        return nil, err
    }
    db.mu.Lock()
    dc := &driverConn{
        db:         db,
        createdAt:  nowFunc(),
        returnedAt: nowFunc(),
        ci:         ci,
        inUse:      true,
    }
    db.addDepLocked(dc, dc)
    db.mu.Unlock()
    return dc, nil
}

Focusing on the use an existing free conn bit:

    // Prefer a free connection, if possible.
    numFree := len(db.freeConn)
    if strategy == cachedOrNewConn && numFree > 0 {
        conn := db.freeConn[0]
        copy(db.freeConn, db.freeConn[1:])
        db.freeConn = db.freeConn[:numFree-1]
        conn.inUse = true
        if conn.expired(lifetime) {
            db.maxLifetimeClosed++
            db.mu.Unlock()
            conn.Close()
            return nil, driver.ErrBadConn
        }
        db.mu.Unlock()

        // Reset the session if required.
        if err := conn.resetSession(ctx); err == driver.ErrBadConn {
            conn.Close()
            return nil, driver.ErrBadConn
        }

        return conn, nil
    }

Return driver.ErrBadConn if conn is too old or if "resetting the session" returns driver.ErrBadConn. Maybe calling resetSession fails with driver.ErrBadConn in case CRDB has closed the conn on the server side already? In which case only way to hit this issue when using *DB's ExecContext is a race where resetSession runs right before CRBD closes all conns.

Looks about right:

// SessionResetter may be implemented by Conn to allow drivers to reset the
// session state associated with the connection and to signal a bad connection.
type SessionResetter interface {
    // ResetSession is called prior to executing a query on the connection
    // if the connection has been used before. If the driver returns ErrBadConn
    // the connection is discarded.
    ResetSession(ctx context.Context) error
}

Specifically this part in the docs: "... to signal a bad connection".

From https://golang.org/src/database/sql/driver/driver.go

In which case only way to hit this issue when using *DB's ExecContext is a race where resetSession runs right before CRBD closes all conns.

So a user of database/sql can be affected. But perhaps not very likely?

OTOH, if you call Connection (https://golang.org/pkg/database/sql/#DB.Conn) & hold onto it for some time, you can increase the likelihood of being affected by above bug quite a lot, as there is more time for conn to be closed between resetSession call and the using of the conn. Minimizing how long you hold onto an explicit connection -- or actually ideally avoiding it all together (use *DB instead) -- seems very wise to me. Possibly some of our customers don't do that.

Not sure I have this right but I am learning things at least...

joshimhoff commented 3 years ago

Also, @rafiss, after reading database/sql, I see what you mean when you say:

One thing missing from the issue summary here is that connection pools already check for the health of a connection before using it.

I'll update the issue summary.

Wondering if you have any thoughts about the database/sql analysis above. In particular this bit as I could see it contributing to the customer issue we are looking at:

OTOH, if you call Connection (https://golang.org/pkg/database/sql/#DB.Conn) & hold onto it for some time, you can increase the likelihood of being affected by above bug quite a lot, as there is more time for conn to be closed between resetSession call and the using of the conn. Minimizing how long you hold onto an explicit connection -- or actually ideally avoiding it all together (use *DB instead) -- seems very wise to me. Possibly some of our customers don't do that.

joshimhoff commented 3 years ago

Now I am looking at the pgx driver for database/sql and I found something very interesting!! https://github.com/jackc/pgx

Look at this beautiful thing:

func (c *Conn) ExecContext(ctx context.Context, query string, argsV []driver.NamedValue) (driver.Result, error) {
    if c.conn.IsClosed() {
        return nil, driver.ErrBadConn
    }

    args := namedValueToInterface(argsV)

    commandTag, err := c.conn.Exec(ctx, query, args...)
    // if we got a network error before we had a chance to send the query, retry
    if err != nil {
        if pgconn.SafeToRetry(err) {
            return nil, driver.ErrBadConn
        }
    }
    return driver.RowsAffected(commandTag.RowsAffected()), err
}

Combine above with below from database/sql:

// ExecContext executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) ExecContext(ctx context.Context, query string, args ...interface{}) (Result, error) {
    var res Result
    var err error
    for i := 0; i < maxBadConnRetries; i++ {
        res, err = db.exec(ctx, query, args, cachedOrNewConn)
        if err != driver.ErrBadConn {
            break
        }
    }
    if err == driver.ErrBadConn {
        return db.exec(ctx, query, args, alwaysNewConn)
    }
    return res, err
}

I think if you use database/sql, pgx as a driver, and query via DB.ExecContext (query against *DB & don't open a txn), you will NOT ever hit the bug described in this ticket. If a conn is closed by CRBD, pgx will return driver.ErrBadConn, leading to a retry. This is what I was hoping for when I wrote:

Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found?

(Maybe there is a race if query happens right as conn is being closed meaning conn is not fully closed yet tho. Sounds like it as per "if we got a network error before we had a chance to send the query".)

Interestingly, I think you CAN hit the failure mode when trying to open a txn, as there is no returning of driver.ErrBadConn (except when you already know the conn is closed without sending over the network (that is, the client closed the conn, not the server)):

func (c *Conn) BeginTx(ctx context.Context, opts driver.TxOptions) (driver.Tx, error) {
    if c.conn.IsClosed() {
        return nil, driver.ErrBadConn
    }

    if pconn, ok := ctx.Value(ctxKeyFakeTx).(**pgx.Conn); ok {
        *pconn = c.conn
        return fakeTx{}, nil
    }

    var pgxOpts pgx.TxOptions
    switch sql.IsolationLevel(opts.Isolation) {
    case sql.LevelDefault:
    case sql.LevelReadUncommitted:
        pgxOpts.IsoLevel = pgx.ReadUncommitted
    case sql.LevelReadCommitted:
        pgxOpts.IsoLevel = pgx.ReadCommitted
    case sql.LevelRepeatableRead, sql.LevelSnapshot:
        pgxOpts.IsoLevel = pgx.RepeatableRead
    case sql.LevelSerializable:
        pgxOpts.IsoLevel = pgx.Serializable
    default:
        return nil, fmt.Errorf("unsupported isolation: %v", opts.Isolation)
    }

    if opts.ReadOnly {
        pgxOpts.AccessMode = pgx.ReadOnly
    }

    tx, err := c.conn.BeginTx(ctx, pgxOpts)
    if err != nil {
        return nil, err
    }

    return wrapTx{ctx: ctx, tx: tx}, nil
}
joshimhoff commented 3 years ago

I updated the issue summary as per feedback from Rafi!

jordanlewis commented 3 years ago

A very long term idea: we could think about implementing live TCP migrations. There are systems that allow you to move a TCP connection from one node to another, with no interruption to the other end.

https://twitter.com/colmmacc/status/1407820679962054662?s=20

https://ipsj.ixsq.nii.ac.jp/ej/?action=repository_action_common_download&item_id=68424&item_no=1&attribute_id=1&file_no=1

joshimhoff commented 3 years ago

Soft of relevant database/sql issue posted by @JeffSwenson: https://github.com/golang/go/issues/11978

(Maybe there is a race if query happens right as conn is being closed meaning conn is not fully closed yet tho. Sounds like it as per "if we got a network error before we had a chance to send the query".)

In other words, yes, there is a race.

knz commented 2 years ago

@rafiss @joshimhoff for my own education, can you tell me how this issue and #66319 differ from each other?

joshimhoff commented 2 years ago

I think #66319 proposes a modification to CRBD (see title especially) that may help with the issue outlined in this ticket. If we want to merge things somehow, that SGTM, tho I think it's reasonable to not given above. I feel even more that the details in this ticket re: the nature of the problem should be stored somewhere, as it took a long time for me, Rafi, & Olvier get on the same page about the problem, as it is rather subtle.

rafiss commented 2 years ago

I wrote a toy test program, and found that using the pgx driver with database/sql reproduces the server is shutting down error. I ran the code below against a local 3-node cluster behind HAProxy, then started a graceful shutdown of one of the nodes.

Expand this to see the `pgx` with `database/sql` code ``` package main import ( "database/sql" "fmt" _ "github.com/jackc/pgx/v4/stdlib" _ "github.com/lib/pq" "github.com/shettyh/threadpool" ) func main() { url := "postgresql://root@localhost:26257/defaultdb?sslmode=disable" db, err := sql.Open("pgx", url) if err != nil { panic(err) } db.SetMaxOpenConns(10) db.SetMaxIdleConns(10) pool := threadpool.NewThreadPool(10, 100000) var results []*threadpool.Future for i := 0; i < 1250; i++ { t := &theTask{db} res, err := pool.ExecuteFuture(t) if err != nil { panic(err) } results = append(results, res) } for i, resF := range results { res := resF.Get().(bool) activeConns := db.Stats().InUse totalConns := db.Stats().OpenConnections waitDur := db.Stats().WaitDuration maxLifetimeClosed := db.Stats().MaxLifetimeClosed fmt.Printf("%t %d activeConns=%d totalConns=%d waitDur=%d maxLifetimeClosed=%d\n", res, i, activeConns, totalConns, waitDur.Milliseconds(), maxLifetimeClosed) } } type theTask struct { db *sql.DB } func (t *theTask) Call() interface{} { var b bool row := t.db.QueryRow("SELECT pg_sleep(0.5)") if err := row.Scan(&b); err != nil { fmt.Printf("error! %v\n", err) return false } if !b { panic("not possible") } return b } ```

Note the error does not happen when using database/sql with lib/pq. (In the above code replace sql.Open("pgx", url) with sql.Open("postgres", url).) This is because the error handling at https://github.com/lib/pq/blob/9e747ca50601fcb6c958dd89f4cb8aea3e067767/error.go#L481 seems to return ErrBadConn when CRDB is shutting down, which allows the retry logic that @joshimhoff described above to work.

The error does happen also when using pgx with pgxpool (see next snippet for code). So it looks like the pgx detection for "safe to retry" doesn't handle the way CRDB shuts down. My reading of https://github.com/jackc/pgconn/blob/53f5fed36c570f0b5c98d6ec2415658c7b9bd11c/pgconn.go#L949 is that pgx only marks an error as safe to retry if 0 bytes have been written to the socket.

Expand this to see the `pgx` with `pgxpool` code ``` package main import ( "context" "fmt" "github.com/jackc/pgx/v4/pgxpool" _ "github.com/jackc/pgx/v4/stdlib" _ "github.com/lib/pq" "github.com/shettyh/threadpool" ) func main() { ctx := context.Background() url := "postgresql://root@localhost:26257/defaultdb?sslmode=disable" config, _ := pgxpool.ParseConfig(url) config.MaxConns = 10 config.MinConns = 5 db, err := pgxpool.ConnectConfig(ctx, config) if err != nil { panic(err) } pool := threadpool.NewThreadPool(10, 100000) var results []*threadpool.Future for i := 0; i < 1250; i++ { t := &theTask{db} res, err := pool.ExecuteFuture(t) if err != nil { panic(err) } results = append(results, res) } for i, resF := range results { res := resF.Get().(bool) activeConns := db.Stat().AcquiredConns() totalConns := db.Stat().IdleConns() + activeConns acquireDur := db.Stat().AcquireDuration() emptyAcquireCount := db.Stat().EmptyAcquireCount() fmt.Printf("%t %d activeConns=%d totalConns=%d acquireDur=%d emptyAcquireCount=%d\n", res, i, activeConns, totalConns, acquireDur.Milliseconds(), emptyAcquireCount) } } type theTask struct { db *pgxpool.Pool } func (t *theTask) Call() interface{} { var b bool row := t.db.QueryRow(context.Background(), "SELECT pg_sleep(0.5)") if err := row.Scan(&b); err != nil { fmt.Printf("error! %v\n", err) return false } if !b { panic("not possible") } return b } ```
joshimhoff commented 2 years ago

NICE REPRO!

joshimhoff commented 2 years ago

found that using the pgx driver with database/sql reproduces the server is shutting down error

Does it repro both ERROR: server is shutting down AND I/O timeouts post that due to conn closure (as happened with the earlier CC customer)? Otherwise the issue could be more:

But returning ERROR: server is shutting down first means a bit of unavailability ahead of the conn being closed, as database/sql can’t retry around that error. Thoughts, Rafi?

I believe the thinking is that since this is an “admin shutdown” error (see 57P01 on Appendix A. PostgreSQL Error Codes ), the client should be able to reliably retry around this error. But of course that depends on the driver, and I don’t know which drivers are smart about 57P01 and which are not.

That is, perhaps the pgx driver with database/sql doesn't retry around ERROR: server is shutting down but does around trying to send to a closed conn?

We could experiment by patching them to a special build that doesn't ever return ERROR: server is shutting down. Or actually we could just use your repro!!

rafiss commented 2 years ago

closing as part of Epic CRDB-10458