Open joellubi opened 1 month ago
Wow, that's really odd. My first thought to ask here is whether or not this reproduces on a non-Mac platform or if this is specific to MacOS?
I don't get this (macOS aarch64) when building from main.
>>> import adbc_driver_flightsql.dbapi
>>> import adbc_driver_snowflake.dbapi
>>>
>>> adbc_driver_flightsql.dbapi.connect("grpc+tcp://localhost:8080")
/Users/lidavidm/Code/arrow-adbc/python/adbc_driver_manager/adbc_driver_manager/dbapi.py:307: Warning: Cannot disable autocommit; conn will not be DB-API 2.0 compliant
warnings.warn(
<adbc_driver_manager.dbapi.Connection object at 0x13cb4cb50>
>>> adbc_driver_snowflake.dbapi.connect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lidavidm/Code/arrow-adbc/python/adbc_driver_snowflake/adbc_driver_snowflake/dbapi.py", line 120, in connect
conn = adbc_driver_manager.AdbcConnection(db, **(conn_kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "adbc_driver_manager/_lib.pyx", line 681, in adbc_driver_manager._lib.AdbcConnection.__init__
File "adbc_driver_manager/_lib.pyx", line 237, in adbc_driver_manager._lib.check_error
adbc_driver_manager.OperationalError: IO: 260000: account is empty
It also doesn't reproduce with the 0.11.0 wheels so this appears to be a macOS x64 issue, which is going to be difficult to debug as I only have an aarch64 machine.
Are you able to reproduce it on any other platforms?
ok, looks like it does happen under rosetta
However, debugging under rosetta is not working great as lldb appears to get stuck/execution appears to be extremely slow.
Actually under lldb the process just hangs forever, even if we only import/use a single driver. So that won't help.
@zeroshade is there a problem perhaps with having two copies of the go runtime in one process?
I do also see that there are public symbols in the driver from cgo et al
I think we're between a rock and a hard place here.
I think the interim solution would be to build a "combined" driver that has both drivers in one library.
I'm hesitant to hold up the release though since this has been a latent issue for a while.
CC @zeroshade @kou @paleolimbot for opinions on the last part too.
@cocoa-xu this will also affect BigQuery...it seems both C++ and Go have packaging issues here...
FWIW, the issue above also seems to indicate it can only be reproduced on amd64 macOS.
Since this is a latent issue for quite some time, and at least up to this point it appears that our consumers don't yet tend to have multiple drivers in one process (or if they do, it's not on macos and isn't exhibiting the issue) I'm okay with this not being a release blocker. We're going to have to figure something out, but I think it's fine to release without a fix for this. Maybe we just add something to the documentation to point out the problem for now and note that we're working on it?
I agree that it is not a release blocker...the improvements to the individual drivers are (in my opinion) important to release whilst we improve the situation for using multiple drivers in one process. I do have an x86 mac locally that I might be use to iterate on some of this...in R we're using the c-static build mode instead of the c-shared one (but I assume we would have the same issue there).
@paleolimbot depending on how R loads the module and it's dependencies, it might not be an issue with the static libs. If you load a bunch of static lib libraries, depending on how they are built, you might be able to only pull in one copy of the Go runtime. Potentially
I agree with moving forward with the release. Would it make sense to delay the 1.0.0 release until this is resolved, and instead release as 0.12.0?
@joellubi I don't think it's necessary to delay the 1.0.0 release of the couple of drivers we're doing so for. Since the issue is currently effectively limited to just amd64 Macos and in the most common use cases, users are more likely to be switching between drivers rather than using multiple drivers at the same time. With the doc update saying it's a known issue, I'm fine with moving forward as is.
I can reproduce this in R as well (but only on my very old x86 laptop):
library(adbcdrivermanager)
flightsql <- adbc_database_init(
adbcflightsql::adbcflightsql(),
uri = "grpc+tcp://localhost:8080"
)
snowflake <- adbc_database_init(adbcsnowflake::adbcsnowflake())
I tried adding linker flags to suppress CGO symbols:
-Wl,-exported_symbol,_adbcflightsql_c_flightsql -Wl,-exported_symbol,_R_init_adbcflightsql
-Wl,-exported_symbol,_adbcsnowflake_c_adbcsnowflake -Wl,-exported_symbol,_R_init_adbcsnowflake
...and checked nm --extern-only src/*.so
to make sure it worked, but I still get a crash.
I wonder, are you able to get a backtrace from lldb? Under rosetta it appears the Go runtime fails to initialize entirely (it appears to get stuck holding a lock) so I wasn't able to do so from an M1 machine.
Sure!
The non-lldbd backtrace from the R snippet above is:
The visual version of what pops up in VSCode when I attach a debugger and trigger the crash (the file is local/go/src/runtime/chan.go
):
The result of bt all
:
@cocoa-xu this will also affect BigQuery...it seems both C++ and Go have packaging issues here...
It appears to be so...
I have an x86_64 macOS so please let me know if there're anything I can help @lidavidm
@cocoa-xu this will also affect BigQuery...it seems both C++ and Go have packaging issues here...
So I tried the following code on my x86_64 macOS and it had the same crash:
import adbc_driver_snowflake.dbapi
import adbc_driver_bigquery.dbapi
adbc_driver_snowflake.dbapi.connect()
adbc_driver_bigquery.dbapi.connect()
runtime.throw({0x18223ac23?, 0x1c0000061a0?})
/Users/runner/hostedtoolcache/go/1.21.8/x64/src/runtime/panic.go:1077 +0x5c fp=0x1c00069be88 sp=0x1c00069be58 pc=0x18080387c
runtime.sigpanic()
/Users/runner/hostedtoolcache/go/1.21.8/x64/src/runtime/signal_unix.go:845 +0x3e9 fp=0x1c00069bee8 sp=0x1c00069be88 pc=0x180819c09
runtime.(*waitq).dequeue(...)
/Users/runner/hostedtoolcache/go/1.21.8/x64/src/runtime/chan.go:781
runtime.closechan(0x1c00010e000)
/Users/runner/hostedtoolcache/go/1.21.8/x64/src/runtime/chan.go:380 +0x15d fp=0x1c00069bf40 sp=0x1c00069bee8 pc=0x1807d413d
runtime.main()
/Users/runner/hostedtoolcache/go/1.21.8/x64/src/runtime/proc.go:256 +0x28f fp=0x1c00069bfe0 sp=0x1c00069bf40 pc=0x18080626f
runtime.goexit()
/Users/runner/ho
runtime.closechan(0x1c00010e000)
-> this is close(main_init_done)
in the source. main_init_done
is a global. I wonder if this is somehow getting shared between the runtimes on this platform?
would .NET have the same issues as Go above with multiple shared libraries and hence multiple runtimes in one process?
No, but see https://github.com/dotnet/runtime/discussions/102048
EDIT: if you follow the links from that reply, you get a pretty sobering picture about the viability of loading multiple mid-to-heavy weight runtimes into the same process. I guess it's back to Rust, then ;).
I guess it's back to Rust, then ;).
Not quite sure if this is related, but it seems that we don't have an official port of gRPC for Rust yet: https://github.com/grpc/grpc/issues/9316
There're some open source gRPC wrappers used in production, but I'm not sure if they can be loaded from different shared libraries without issues like this, or the similar issue with C++.
we don't have an official port of gRPC for Rust yet
Wow, I definitely didn't expect that. It looks like Tonic is fairly widely used, though. Apparently, there's no shortage of reasons for why it's so hard to move away from C as an implementation language.
As a thought experiment, let's say that we wanted to implement a bunch of drivers in some (uniform) non-C language and then consume them via the C API from Python.
For Go, the implication is that all the Go drivers would need to be bundled into a single cgo-based dynamic library in order to ensure no conflicts between them. It also implies that there could be problems trying to load any non-ADBC extension which also happened to be implemented in Go.
For C# with AOT compilation, there wouldn't be direct conflicts between the individual drivers, but there would be an increasingly heavy burden in terms of thread and address space consumption. The larger problem would be that a lot of existing code doesn't work with AOT compilation. This seems to cover a good chunk of what's needed to connect to BigQuery. The resource consumption issue could also be resolved by bundling all the drivers together into a single dynamic library.
For Java and for C# without AOT compilation, it would require the user to have the corresponding runtime installed on their machine. All the drivers implemented in those languages could then be dynamically loaded into that runtime and would share a single set of runtime features like the garbage collector. They could then also share dependencies for an overall smaller footprint -- but of course this has the corresponding drawback that isolation between dependencies is more difficult. There would also need to be a way to dynamically register with a shared ADBC driver manager in-memory, because in this scenario there wouldn't be native entry points available for the drivers.
Feel free to add your favorite language to the list... .
One other nuclear option is to spawn a sacrificial process for each driver that loads the driver and communicates with the main process over some sort of IPC/shared memory, which at least for the data parts should be relatively OK as Arrow's shared memory support shapes up
There would also need to be a way to dynamically register with a shared ADBC driver manager in-memory, because in this scenario there wouldn't be native entry points available for the drivers.
FWIW, C# without AOT compilation could use something like DNNE to create the native entrypoints. DNNE allows building native libraries with proper exports. Behind the scenes, the native library uses the .NET runtime hosting APIs to activate the runtime and call into managed code.
EDIT: if you follow the links from that reply, you get a pretty sobering picture about the viability of loading multiple mid-to-heavy weight runtimes into the same process. I guess it's back to Rust, then ;).
As a sidenote, these problems are not unique to mid-to-heavy weight runtimes. Any time one loads a dynamic library with a different memory allocator, these problems can show up. They only don't show up if everyone agrees on using malloc/free from the same C runtime library. As soon as libraries start being fancy and each brings their own tcmalloc/jemalloc/whatever (and they often are), you now have multiple heaps with their own heuristics for expanding/giving memory back to the OS. (E.g. I don't know how Rust manages memory, but it's very likely each Rust DLL would have its own heap as well).
I've been doing some research into this and came across the following library: gopy. It's meant to generate and compile CGO and CPython extensions from a Go package. What's specifically interesting about it is the unique-handler approach it uses to avoid passing Go pointers directly to C. There's a brief description in the README and the actual handler implementation is in handler.go.
I'm still wrapping my head around what a specific implementation would look like, but I wonder if we could adopt a similar approach to isolate pointers within their respective Go runtimes. The actual memory we do want to share (arrays) should be allocated in C anyway via the CGO Mallocator, and with an approach like this all Go pointers are only accessed within the Go runtime via stub methods that delegate the actions to be done.
It's not super obvious how it does all this just by looking at the source code, but if you look at the code generated by running the "command line example" in the readme it makes the mechanism clearer.
I'll definitely try to take a look at that sometime this week. It's an interesting idea, I'm just not sure whether or not it'll solve our problem here. If I look closely at our existing implementation, we do utilize cgo.Handler
to avoid passing Go pointers directly to C. So unless I'm missing something, i'm pretty sure we're already doing this isolation (which may be why we're not seeing issues on other platforms, only on macos?)
why we're not seeing issues on other platforms, only on macos?
It's worth checking clang
+ libc++ on x86 linux, too (e.g., docker run --rm ghcr.io/r-hub/containers/clang18
). It is a pretty exotic platform to find in the wild but would rule out if this is a MacOS problem or a clang/libc++ problem.
(E.g. I don't know how Rust manages memory, but it's very likely each Rust DLL would have its own heap as well).
When you're building a DLL, the allocator in Rust defaults to using malloc on Unix platforms and HeapAlloc on Windows.
FYI: I'd like to use DuckDB and Postgres ADBC drivers in Go. Not sure its connected but on testing I get a SIGSEV also on reading from a query result. So +1 from me if this can be supported.
If you're trying to use the drivers from Go, then it's unlikely to be related to this issue. Can you file a new issue for that SIGSEV?
Just to keep everyone apprised here, I'm gonna try posting in the #cgo and #darkarts channels on the Gopher slack to see if anyone might know of a workaround or other way to avoid this crash. Even though it is still not optimal to end up with multiple copies of the Go runtime in a single process, it would still be good to at least get it to not crash.
@zeroshade was there any feedback from those channels?
Nothing useful unfortunately, just one reply that reiterated the fact that it's technically undefined behavior etc.
What happened?
When using both the snowflake driver and the flightsql driver in the same python program, the process crashes with a segmentation violation. Using one of these drivers at the same time as the postgres driver does not cause a crash, which makes it seem like there might be an issue in the common Go memory management of the snowflake and flightsql drivers.
How can we reproduce the bug?
Minimal Reproduction:
Result:
Environment/Setup
Reproduces with:
and:
Does NOT reproduce with:
However, as I've tried pushing against various edge cases I have been able to get some segmentation violations with
0.10.0
too, but haven't yet isolated the specific behavior that causes it in that version.