go/adbc/driver/flightsql: performance loss in go1.22.5/go1.21.12

lidavidm commented 1 month ago

What happened?

There appears to be an incredible amount of FFI overhead when using a driver built by recent versions of Go.

lidavidm@debian ~/C/arrow-adbc ((adfb70a2))> python tester.py                                                                                                                     (adbc-good) 
Driver: /home/lidavidm/miniforge3/envs/adbc-good/lib/libadbc_driver_flightsql.so
Completed 100 iterations
Completed 200 iterations
Completed 300 iterations
Completed 400 iterations
Completed 500 iterations
Mean:   0.005054534418188268
Median: 0.004982210506568663
Sum:    2.527267209094134

lidavidm@debian ~/C/arrow-adbc ((adfb70a2))> python tester.py                                                                                                                      (adbc-bad) 
Driver: /home/lidavidm/miniforge3/envs/adbc-bad/lib/libadbc_driver_flightsql.so
Completed 100 iterations
Completed 200 iterations
Completed 300 iterations
Completed 400 iterations
Completed 500 iterations
Mean:   0.023634329783672
Median: 0.02316386450547725
Sum:    11.817164891836

Stack Trace

N/A

How can we reproduce the bug?

$ go run github.com/apache/arrow/go/v17/arrow/flight/flightsql/example/cmd/sqlite_flightsql_server@latest -port 8080

import statistics
import time
import sys
import adbc_driver_flightsql.dbapi

print("Driver:", adbc_driver_flightsql._driver_path())

times = []

for i in range(500):
    start = time.monotonic()
    with adbc_driver_flightsql.dbapi.connect("grpc://localhost:8080") as conn:
        with conn.cursor() as cursor:
            cursor.execute("SELECT 1")
            cursor.fetchall()
    end = time.monotonic()
    times.append(end - start)

    if (i + 1) % 100 == 0:
        print(f"Completed {i + 1} iterations")

print("Mean:  ", statistics.mean(times))
print("Median:", statistics.median(times))
print("Sum:   ", sum(times))
sys.exit(0)

Environment/Setup

Linux x86_64

Conda package versions 1.0.0 (built with go1.21.7, IIRC) vs 1.1.0 (go 1.22.5)

lidavidm commented 1 month ago

I eventually bisected this to a change in the Go compiler: https://github.com/golang/go/commit/3560cf0afb

@zeroshade believes it may be mallocator, but a pure-Go reproducer does not exhibit any problems:

lidavidm@debian ~/C/a/g/adbc ((adfb70a2)) [1]> go test -v ./driver/flightsql/... -bench=. -run=nonexistent -benchtime 5s                                                            (adbc-go) 
goos: linux
goarch: amd64
pkg: github.com/apache/arrow-adbc/go/adbc/driver/flightsql
cpu: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
BenchmarkDefaultAllocator
BenchmarkDefaultAllocator-16            4518       1195807 ns/op
BenchmarkMallocator
BenchmarkMallocator-16                  4952       1237689 ns/op
PASS
ok      github.com/apache/arrow-adbc/go/adbc/driver/flightsql   16.428s
?       github.com/apache/arrow-adbc/go/adbc/driver/flightsql/cmd/testserver    [no test files]

Attempt at reproduction in Go

```go package flightsql import ( "context" "testing" "github.com/apache/arrow/go/v16/arrow/memory" "github.com/apache/arrow/go/v16/arrow/memory/mallocator" ) func BenchmarkDefaultAllocator(b *testing.B) { DoBenchmark(b, memory.DefaultAllocator) } func BenchmarkMallocator(b *testing.B) { DoBenchmark(b, mallocator.NewMallocator()) } func DoBenchmark(b *testing.B, alloc memory.Allocator) { for i := 0; i < b.N; i++ { func() { driver := NewDriver(alloc) db, err := driver.NewDatabase(map[string]string{"uri": "grpc://localhost:8080"}) if err != nil { b.Fatal(err) } defer db.Close() cnxn, err := db.Open(context.Background()) if err != nil { b.Fatal(err) } stmt, err := cnxn.NewStatement() if err != nil { b.Fatal(err) } defer stmt.Close() if err = stmt.SetSqlQuery("SELECT 1"); err != nil { b.Fatal(err) } reader, _, err := stmt.ExecuteQuery(context.Background()) if err != nil { b.Fatal(err) } reader.Release() }() } } ```

zeroshade commented 1 month ago

So running things locally myself, I was using the GOTOOLCHAIN env var in order to build using different versions of go. Using your example and just swapping out the binary shared object which is being loaded, I'm not seeing anywhere near the level of difference that you are seeing comparing go1.22.5 and go1.21.12:

go1.22.5:

Completed 100 iterations
Completed 200 iterations
Completed 300 iterations
Completed 400 iterations
Completed 500 iterations
Mean:   0.012383669883012772
Median: 0.011782394954934716
Sum:    6.191834941506386

go1.21.12:

Completed 100 iterations
Completed 200 iterations
Completed 300 iterations
Completed 400 iterations
Completed 500 iterations
Mean:   0.01193752330634743
Median: 0.011252027237787843
Sum:    5.968761653173715

And if I run the script multiple times, I end up with roughly the same variance between both cases. So i'm not quite sure what's going on and how you're reproducing it.

zeroshade commented 1 month ago

@lidavidm even more interesting is that the change your referring to was only introduced in go1.22.5 and was backported to go1.21.12. Which means both versions you pointed out in this issue have that change in them, so something doesn't add up here.

lidavidm commented 1 month ago

Again, look at the report...It's go1.21.7 for the "reference". I specifically was not comparing 1.21.12.

zeroshade commented 1 month ago

Strap in friends, I've narrowed down the cause and it's a doozy!

In case you don't want to follow along, here's the TL;DR:

https://github.com/golang/go/commit/3560cf0afb introduced a change into go1.22.5 (and was backported to go1.21.12) which forces Go always update the stack boundaries when making a cgo call. This has the effect of causing a significant performance degradation in the scenario of repeated calls to the ADBC driver (in the example @lidavidm provided above we go from ~2s to >~5.5s). For the time being this can be mitigated by building our distributable binaries with go1.22.4 or go1.21.11 which do not have this change. But going forward this is a problem. At the moment, the only likely solution I can think of is to file an issue with the golang/go repo and see if someone can fix this.

Now, if you want the gory details of this let's have some fun!

https://github.com/golang/go/commit/3560cf0afb was introduced to address the issue identified by https://github.com/golang/go/issues/62440 in which the issue was ultimately that in between calls to a Go callback from C, C was changing out the stack leading to a crash when the known bounds of a goroutine didn't have enough space on the stack left to operate. The solution ended up to always have the Goroutine update its stack boundaries before going on to perform a call from C.

So naturally, we assumed that the performance issue was due to the calls into Go from C being more expensive, but we didn't know why this was the case as getting some stack boundaries shouldn't be an expensive operation, right?

After getting the example working with pprof, I did some digging to see if the issue could be in the upstream Arrow library (the cdata package) or in how we package the drivers here. What stuck out to me was this:

Notice that it just says runtime._ExternalCode which is calling libc.so.6 and taking up over 50% of the runtime. This was confusing because according to this, the slowdown is in libc, not in the Go runtime? How could that be?

For comparison, if I compile with go1.22.4 instead, I get the following flamegraph:

Note how small the runtime._ExternalCode entry there is.

Important note: I had to disable all our manual calls to runtime.GC() in the driver.go file otherwise all of the GC calls, mark and sweep, and all of that ends up dominating the profiles making it much more difficult to pin down the causes.

So I dd a bunch of research and found out that runtime._ExternalCode is just what pprof puts when it has no insight into the profiling. Which means it's native code (C or assembly) which pprof can't report the profiling for. So now what? Well, there is the handy dandy perf tool on linux!

After reading a bunch of tutorials and figuring out how it all works, I gave it a shot and got the following by using perf record -g -- python repro.py...

When compiling with go1.22.4 I got the following results from perf:

This breaks down to roughtly:

66.59% -> blas_thread_server (i.e. numpy doing stuff! not sure why, but weird!)
13.60% -> runtime.goexit.abi0 (grpc cleanup callbacks)
the next several ones collectively add up to ~38% -> cgo callbacks (i.e. the Driver implementation functions)

When I compile with go1.22.5 I get this instead:

Which breaks down to:

35.85% -> blas_thread_server (i.e. numpy. seriously, wtf??)
7.74% -> runtime.goexit.abi0 (grpc cleanup)
the cgo callbacks for the driver collectively add up to ~42%
21.56% -> __isoc99_sscanf

This is where I had a huge WTF moment: how the hell did sscanf get involved here? And why the hell is it taking up so much time?? After a few hours, I was finally able to figure it out.

I played around with the options in perf and googled a LOT and ultimately found that I could try changing the way it attempts to determine the call-graph and frames. I tried using --call-graph dwarf but that failed horribly, as I expected, because we're not using DWARF executables. But when I changed it to lbr I was able to get the following output:

In that screenshot you can see the full chain:

Before executing the Go code from the CGO callback, it calls runtime.callbackUpdateSystemStack
Which internally calls asmcgocall to x_cgo_getstackbound
Which then calls pthread_getattr_np
Which apparently, calls __isoc99_sscanf (which is just what GCC calls whenever you use sscanf in C code).

And the calls to sscanf end up completely and utterly dominating the entire cost of the release callbacks, which is ludicrous and hilarious.

Ultimately, I think I'm gonna have to file an issue on golang/go repo to address this because I can't see any other way around it.

lidavidm commented 1 month ago

From a quick glance at glibc source it's because pthread_getattr_np is implemented by parsing /proc/self/maps

apache / arrow-adbc