dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
385 stars 71 forks source link

bug: segfault when performing dask sql queries #540

Open brightsparc opened 2 years ago

brightsparc commented 2 years ago

What happened:

Having upgrade to dask-sql-2022.4.1 I have been seeing a number of seg faults that appear to be related to the JPype java iterop.

What you expected to happen:

I was expecting dask-sql to be stable, but I require this upgraded version to support case insensitive queries with dask.

Minimal Complete Verifiable Example:

I am yet to repro with a concise example, but have included the seg fault log file if this helps.

---------------  S U M M A R Y ------------

Command Line: -ea --illegal-access=deny 

Host: MacBookPro18,3 arm64 1 MHz, 10 cores, 32G, Darwin 21.4.0
Time: Fri Apr 29 20:24:41 2022 AEST elapsed time: 0.097347 seconds (0d 0h 0m 0s)

---------------  T H R E A D  ---------------

Current thread (0x0000000122c6c800):  JavaThread "Python Reference Queue" daemon [_thread_in_native, id=42243, stack(0x00000002aee70000,0x00000002af073000)]

Stack: [0x00000002aee70000,0x00000002af073000],  sp=0x00000002af072730,  free space=2057k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [python3.8+0x1bbacc]  collect+0x200
C  [python3.8+0x1bcf4c]  PyGC_Collect+0x9c
C  [_jpype.cpython-38-darwin.so+0x22368]  _ZN19JPGarbageCollection9triggeredEv+0x34
C  [_jpype.cpython-38-darwin.so+0x310c8]  Java_org_jpype_ref_JPypeReferenceNative_wake+0x18
j  org.jpype.ref.JPypeReferenceNative.wake()V+0
j  org.jpype.ref.JPypeReferenceQueue$Worker.run()V+42
j  java.lang.Thread.run()V+11 java.base@11.0.9.1
v  ~StubRoutines::call_stub
V  [libjvm.dylib+0x329d14]  _ZN9JavaCalls11call_helperEP9JavaValueRK12methodHandleP17JavaCallArgumentsP6Thread+0x2e0
V  [libjvm.dylib+0x329074]  _ZN9JavaCalls12call_virtualEP9JavaValueP5KlassP6SymbolS5_P17JavaCallArgumentsP6Thread+0xec
V  [libjvm.dylib+0x32913c]  _ZN9JavaCalls12call_virtualEP9JavaValue6HandleP5KlassP6SymbolS6_P6Thread+0x64
V  [libjvm.dylib+0x3c0094]  _ZL12thread_entryP10JavaThreadP6Thread+0x78
V  [libjvm.dylib+0x6d55b8]  _ZN10JavaThread17thread_main_innerEv+0x80
V  [libjvm.dylib+0x6d53e4]  _ZN10JavaThread3runEv+0x17c
V  [libjvm.dylib+0x6d3104]  _ZN6Thread8call_runEv+0x78
V  [libjvm.dylib+0x57f96c]  _ZL19thread_native_entryP6Thread+0x13c
C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.jpype.ref.JPypeReferenceNative.wake()V+0
j  org.jpype.ref.JPypeReferenceQueue$Worker.run()V+42
j  java.lang.Thread.run()V+11 java.base@11.0.9.1
v  ~StubRoutines::call_stub

siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x000000016b47b4b8

Register to memory mapping:

 x0=0x000000016b704030 is pointing into the stack for thread: 0x0000000112b83800
 x1=0x00000002af072880 is pointing into the stack for thread: 0x0000000122c6c800
 x2=0x00000002af072878 is pointing into the stack for thread: 0x0000000122c6c800
 x3=0x0 is NULL
 x4=0x0000000000006403 is an unknown value
 x5=0x0 is NULL
 x6=0x0000000000000001 is an unknown value
 x7=0x00000000004c4758 is an unknown value
 x8=

Anything else we need to know?:

This happens both locally on mac arm, and also in ubuntu linux build runners.

Environment:

Thrameos commented 2 years ago

The backtrace shows a referencing issue on the Python side. Not sure which objects trigger it. Clearly multithreaded, so perhaps reference handling issue in which a temporary ref counts dropped resulting incorrect collection and reuse of a live object. Doesn't have to be JPype code itself, though it is certainly suspect. If there is a clear method to simplify this and replicate it could be posted to the JPype issues, though these are often very difficult to trace down unless one can isolate what is happening at the time so we can isolate to the path.

ayushdg commented 2 years ago

Thanks for raising the issue! As @Thrameos mentioned, it can be difficult to track down the source of these kinds of errors without a reproducer. @brightsparc Would it be possible to also include the jpype, dask and distributed versions from an earlier environment where these segfaults weren't seen. Also are you using a LocalCluster for parallelizing the dask computations or is this just with a dask-sql context?