databricks / databricks-sql-go

Golang database/sql driver for Databricks SQL.
Apache License 2.0
37 stars 41 forks source link

Data race of reading row value of timestamp data type #179

Open 7phs opened 1 year ago

7phs commented 1 year ago

Hello,

I found a data race with reading data from Databricks simultaneously. For example, from different tables or reading catalogs metadata simultaneously.

A code to reproduce data race is

import (
    "context"
    "database/sql"
    "fmt"
    "sync"
    "testing"

    dbsql "github.com/databricks/databricks-sql-go"
)

func TestDatabricksDataRace(t *testing.T) {
    var (
        wg         sync.WaitGroup
        ctx        = context.Background()
        w          = make(chan bool)
        routineNum = 10
    )

    wg.Add(routineNum)

    for i := 0; i < routineNum; i++ {
        go func() {
            defer wg.Done()

            <-w

            if err := listCatalogs(ctx); err != nil {
                fmt.Println("ERROR:", err)
            }
        }()
    }

    close(w)

    wg.Wait()
}

func listCatalogs(ctx context.Context) error {
    connector, err := dbsql.NewConnector(
        dbsql.WithServerHostname(cfg.Host),
        dbsql.WithPort(int(cfg.Port)),
        dbsql.WithAccessToken(cfg.AccessToken),
        dbsql.WithHTTPPath(cfg.HTTPPath),
    )
    if err != nil {
        return err
    }

    db := sql.OpenDB(connector)
    defer db.Close()

    for i := 0; i < 10; i++ {
        err := func() error {
            r, err := db.QueryContext(ctx, "SHOW CATALOGS;")
            if err != nil {
                return err
            }
            defer r.Close()

            for r.Next() {
                var s string
                if err := r.Scan(&s); err != nil {
                    return err
                }
            }

            return nil
        }()

        if err != nil {
            return err
        }
    }

    return nil
}

A command to run this test with data race detector:

go test -race -run TestDatabricksDataRace .

A root cause of data race is not initialised field loc of arrow.TimestampType. It initialised in the first call of function arrow.TimestampType :: GetZone().

A workaround of data race is:

func init() {
   // init `arrow.TimestampType` before use it.
    _, _ = arrow.FixedWidthTypes.Timestamp_us.(*arrow.TimestampType).GetToTimeFunc()
}

Environment:

7phs commented 1 year ago

Related issue of Apache Arrow - https://github.com/apache/arrow/issues/38795

7phs commented 5 months ago

databricks-sql-go updated and uses Apache Arrow Go v16 with fixed data race.

This bug is fixed now.

7phs commented 5 months ago

A bug of a data race still exists with reverting the Apache Arrow version to v12.

nverkhachoyan commented 6 days ago

Why was the Apache Arrow version reverted back to v12? It uses golang.org/x/net v0.7.0 which has several vulnerabilities.

7phs commented 4 days ago

@nverkhachoyan The reason for reverting was to maintain backward compatibility with some end-user solutions.

The databricks-sql-go library has been updated to use golang.org/x/net v0.17.0, which addresses all known vulnerabilities. All dependencies adhere to the same versioning level (minor, in this case) as per Go module rules.

However, the data race bug in Apache Arrow v12 still persists.

You can use the following workaround:

func init() {
    _, _ = arrow.FixedWidthTypes.Timestamp_s.(*arrow.TimestampType).GetToTimeFunc()
    _, _ = arrow.FixedWidthTypes.Timestamp_ms.(*arrow.TimestampType).GetToTimeFunc()
    _, _ = arrow.FixedWidthTypes.Timestamp_us.(*arrow.TimestampType).GetToTimeFunc()
    _, _ = arrow.FixedWidthTypes.Timestamp_ns.(*arrow.TimestampType).GetToTimeFunc()
}

Insert this snippet into any appropriate file that uses Apache Arrow.