alexbrainman / odbc

odbc driver written in go
BSD 3-Clause "New" or "Revised" License
356 stars 141 forks source link

Panic when binding `varchar`/`string` types: slice bounds out of range [:668] with capacity 256 #165

Closed hawkaa closed 2 years ago

hawkaa commented 2 years ago

Dear @alexbrainman ,

Thank you for an excellent library! I can really see how much hard work that is put in to making ODBC available for golang users. I was wondering if you could help us out a little with reading variable length strings from Spark/Databricks.

We currently run our backend services in golang. Where we previously have been using PostgreSQL and pq, we now need to connect to a Spark SQL endpoint (via databricks). They offer drivers as both JDBC and ODBC and the latter is indeed what we need. We have been successful in connecting and querying our Spark SQL endpoints with unixodbc, the mentioned Simba ODBC driver, and this package.

However, I have run into an issue reading columns with the STRING type. Here's the code to reproduce:

    log.Println("connecting to spark endpoint")
    dsn := fmt.Sprintf("Driver=%s;HOST=...;SparkServerType=3;PORT=443;ssl=1;AuthMech=3;ThriftTransport=2;transportMode=http;httpPath=...;Schema=default;UID=...;PWD=%s;", driver, pwd)
    db, err := sql.Open("odbc", dsn)
    if err != nil {
        return errors.Wrap(err, "unable to connect to spark endpoint")
    }

    log.Println("selecting first row of transaction table")
    rows, err := db.QueryContext(ctx, "select data from default.transactions limit 1")
    if err != nil {
        return errors.Wrap(err, "unable to query first row from the transactions table")
    }
    log.Println("calling next on rows")
    rows.Next()
    log.Println("BOOM!")

This panics with the following error message:

panic: runtime error: slice bounds out of range [:668] with capacity 256

goroutine 1 [running]:
github.com/alexbrainman/odbc.(*BindableColumn).Value(0xc00009a000, 0xc000129d70, 0x4049952)
    /Users/hakon/go/pkg/mod/github.com/alexbrainman/odbc@v0.0.0-20210605012845-39f8520b0d5f/column.go:259 +0x1b5
github.com/alexbrainman/odbc.(*Rows).Next(0xc00009c000, {0xc000092020, 0x1, 0xc00013e000})
    /Users/hakon/go/pkg/mod/github.com/alexbrainman/odbc@v0.0.0-20210605012845-39f8520b0d5f/rows.go:35 +0xef
database/sql.(*Rows).nextLocked(0xc00009e000)
    /usr/local/go/src/database/sql/sql.go:2967 +0x111
database/sql.(*Rows).Next.func1()
    /usr/local/go/src/database/sql/sql.go:2945 +0x2f
database/sql.withLock({0x4120290, 0xc00009e030}, 0xc000129e68)
    /usr/local/go/src/database/sql/sql.go:3396 +0x8c
database/sql.(*Rows).Next(0xc00009e000)
    /usr/local/go/src/database/sql/sql.go:2944 +0x6f
github.com/duneanalytics/hakon-experimenting/code/004-odbc-spark-debugging/internal/testbench.Run({0x4120d50, 0xc0000180b0}, {0x7ffeefbff9cd, 0x2f}, {0x7ffeefbff9ad, 0x1c})
    /Users/hakon/code/hakon-experimenting/code/004-odbc-spark-debugging/internal/testbench/testbench.go:34 +0x254
main.main()
    /Users/hakon/code/hakon-experimenting/code/004-odbc-spark-debugging/cmd/main.go:24 +0xcc
exit status 2

The default.transactions table contains a data column which has the type string. This is what describe default.transactions return. This is the only valid data type for strings in Spark.

I have seen other issues mention this in the past. Particularly #98 is relevant, where @joshuasprow mentions that VARCHAR columns is by default assigned 256 bytes, while for some databases the length can be much greater than that. I believe that is the issue we are running into as well.

I think I have nailed the problem down a little.

When binding columns, a call to [C.SQLDescribeColW](https://github.com/alexbrainman/odbc/blob/39f8520b0d5f7ee720424b441e026a1892f96f5e/api/zapi_unix.go#L44 is made. This inputs a statement handle, and a column number, and outputs data about the column, including its length. The length returned for the data column is 256. This length is then passed on to NewVariableWidthColumn which rightfully creates a buffer for that column of size 256.

When calling rows.Next(), it reads the first results and tries to fit my 668 length character string into the buffer which is only 256 long which makes the program panic.

I have tried doing this on both OSX and Linux (with unixodbc) and got the same issue. Using unixodbc and the same driver with pyodbc works as expected.

Now my question to you whether you have some pointers on how to solve my problem?

  1. Have I configured my driver wrong?
  2. Are there any input parameters I can use to override the default behaviour for strings without length?
  3. Is this something that can be fixed in this repository? In case, I could be able to open a PR to fix it, but I would need some help and pointers to do so.

Thank you!

Håkon

hawkaa commented 2 years ago

I actually fixed it with a small hack: https://github.com/hawkaa/odbc/commit/ae0055c5dfcbf3084f708acd7d6676085826a0db

But I am unsure whether this is a good idea and whether this is something we can make generic.

alexbrainman commented 2 years ago

Hello @hawkaa

When binding columns, a call to [C.SQLDescribeColW](

https://github.com/alexbrainman/odbc/blob/39f8520b0d5f7ee720424b441e026a1892f96f5e/api/zapi_unix.go#L44

is made. This inputs a statement handle, and a column number, and outputs data about the column, including its length. The length returned for the data column is 256. This length is then passed on to NewVariableWidthColumn which rightfully creates a buffer for that column of size 256. When calling rows.Next(), it reads the first results and tries to fit my 668 length character string into the buffer which is only 256 long which makes the program panic.

Your explanation sounds reasonable. Your ODBC driver uses variable size buffers for SQL_VARCHAR columns, so SQLBindCol ODBC API cannot be used for such columns. SQLBindCol API is memory efficient, because it does not require to allocate buffers every time you read column result.

Your change https://github.com/hawkaa/odbc/commit/ae0055c5dfcbf3084f708acd7d6676085826a0db looks reasonable in your situation. Your approach is sound, it just requires more ODBC calls and more memory allocation / deallocation.

I don't want to accept your change into main repo, because, I believe, that your driver SQLDescribeCol API implementation is incorrect - it returns 256 as column size, and then returns more than 256 of bytes.

Alex

hawkaa commented 2 years ago

Thank you for a detailed answer, @alexbrainman !

I don't want to accept your change into main repo, because, I believe, that your driver SQLDescribeCol API implementation is incorrect - it returns 256 as column size, and then returns more than 256 of bytes.

Of course that change does not belong in this repository. I just wanted to show you how I worked around the problem. I mean, technically, we could have expanded the API of this library with some bindVarcharColumns setting or similar, but as you say, there's a bug in the underlying driver. I will make sure to report that.

Thank you again!