aws / aws-xray-sdk-go

AWS X-Ray SDK for the Go programming language.
Apache License 2.0
276 stars 117 forks source link

Using xray.SQLContext with SetMaxOpenConns/SetConnMaxLifetime sometimes "failed to begin subsegment". #391

Open vbqz opened 1 year ago

vbqz commented 1 year ago

Hi,

My team bumped into failed to begin subsegment panics when using SQLContext together with setting limits on connection numbers and durations.

With aws-xray-sdk-go v1.8.0 the problem still occurs, but logs instead of panics (which is great!). But I wanted to check if we are misusing the SDK in anyway, and give you a heads up if there is some kind of race condition here.

Issue

When using a database connection created like:

instance, err = xray.SQLContext("postgres", url)
...
instance.SetMaxOpenConns(10)
instance.SetConnMaxLifetime(30 * time.Second)

Executing queries with instrumented contexts occasionally caused panics like:

"LOG panic: failed to begin subsegment named 'xyz@localhost': segment cannot be found."

From what I can tell, there is a small chance that database/sql will try to open a fresh conn using context.Background(), instead of using the caller context, or re-using an existing connection.

Reprod

I managed to come up with this slightly contrived program (tons of connections and very low lifetimes) that reliably triggers the failure on my machine:

package main

import (
    "context"
    "database/sql"
    "log"
    "net"
    "time"

    "github.com/aws/aws-xray-sdk-go/xray"

    // "sqlite" database driver.
    _ "modernc.org/sqlite"
)

func initDB() (*sql.DB, error) {
    db, err := xray.SQLContext("sqlite", "tmp.sqlite")
    if err != nil {
        return nil, err
    }

    db.SetMaxOpenConns(10)
    db.SetConnMaxLifetime(time.Millisecond)

    return db, nil

}

func oneRequest(db *sql.DB) error {
    var err error

    ctx, seg := xray.BeginSegment(context.Background(), "main")
    defer seg.Close(err)

    _, err = db.ExecContext(ctx, "select sqlite_version();")
    if err != nil {
        return err
    }

    return nil
}

func xraySink() {
    pc, err := net.ListenPacket("udp", ":2000")
    if err != nil {
        log.Fatalf("listen: %s", err)
    }
    defer pc.Close()

    buf := make([]byte, 1024)
    for {
        pc.ReadFrom(buf)
    }
}

func main() {

    // Give X-ray something to log to.
    go xraySink()

    db, err := initDB()
    if err != nil {
        log.Fatalf("open DB: %s", err)
    }

    start := make(chan struct{})

    for i := 0; i < 10_000; i++ {
        go func(i int) {
            <-start
            err := oneRequest(db)
            if err != nil {
                log.Fatalf("oneRequest failed: %s", err)
            }
        }(i)
    }
    close(start)
    time.Sleep(time.Second)
}

The stacktrace at the failed subsegment is:

0  0x00000001029a3e90 in github.com/aws/aws-xray-sdk-go/xray.BeginSubsegment
   at /Users/viktor/go/pkg/mod/github.com/aws/aws-xray-sdk-go@v1.8.0/xray/segment.go:283
1  0x000000010299db84 in github.com/aws/aws-xray-sdk-go/xray.Capture
   at /Users/viktor/go/pkg/mod/github.com/aws/aws-xray-sdk-go@v1.8.0/xray/capture.go:19
2  0x00000001029ae85c in github.com/aws/aws-xray-sdk-go/xray.(*driverConnector).Connect
   at /Users/viktor/go/pkg/mod/github.com/aws/aws-xray-sdk-go@v1.8.0/xray/sql_go110.go:59
3  0x0000000102267f58 in database/sql.(*DB).openNewConnection
   at /Users/viktor/local/go/src/database/sql/sql.go:1238
4  0x0000000102267ec0 in database/sql.(*DB).connectionOpener
   at /Users/viktor/local/go/src/database/sql/sql.go:1228
5  0x0000000102265a00 in database/sql.OpenDB.func1
   at /Users/viktor/local/go/src/database/sql/sql.go:792
6  0x00000001021a47e4 in runtime.goexit
   at /Users/viktor/local/go/src/runtime/asm_arm64.s:1172

So its the connectionOpener goroutine that gets a request for another connection, and tries to do it with context.Background().

wangzlei commented 1 year ago
_, err = db.ExecContext(ctx, "select sqlite_version();")

If db.ExecContext is an async method, xray sdk would create subsegment after the parent segment closed, then cause xray context missing panic.

Alternatively we recommend user move to opentelemetry, that SDK supports asynchronous better, has no context missing error.