Syncleus / aparapi

The New Official Aparapi: a framework for executing native Java and Scala code on the GPU.
http://aparapi.com
Apache License 2.0
465 stars 59 forks source link

Kernel.dispose always fail in openjdk12 #153

Closed AlexanderFedyukov closed 4 years ago

AlexanderFedyukov commented 4 years ago

Every call of Kernel.dispose fails in openjdk-12 and in openjdk-11 with error:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f9c2c4481ac, pid=21273, tid=21274
#
# JRE version: OpenJDK Runtime Environment (12.0.2+9) (build 12.0.2+9)
# Java VM: OpenJDK 64-Bit Server VM (12.0.2+9, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xc201ac]  OopStorage::Block::release_entries(unsigned long, OopStorage*)+0x3c

 #
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8aebbacf4c, pid=23028, tid=23032
#
# JRE version: OpenJDK Runtime Environment (11.0.4+11) (build 11.0.4+11)
# Java VM: OpenJDK 64-Bit Server VM (11.0.4+11, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xc25f4c]  OopStorage::Block::release_entries(unsigned long, OopStorage::Block* volatile*)+0x3c

, in openjdk-8 works well.

CoreRasurae commented 4 years ago

@AlexanderFedyukov Please detail your full system configuration, Linux distribution, kernel version, libdrm version, mesa version, GPU. I've been using Ubuntu 18.04 LTS, kernel 4.15.0, libdrm 2.4.97, mesa with OpenJDK 11.0.4 and mesa 19.0.8 and kernel.dispose() causes no issue.

AlexanderFedyukov commented 4 years ago

Detailed platform info I can gather later. But I suppose the reason of the issue is in code, I'll prepare and publish sample.

AlexanderFedyukov commented 4 years ago

@CoreRasurae , can you check this sample from aparapi-examples. It falls with the same error too.

AlexanderFedyukov commented 4 years ago

My system is Fedora 30 5.2.8-200.fc30.x86_64 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5) OpenJDK 64-Bit Server VM 19.3 (build 12.0.2+9, mixed mode, sharing)

CoreRasurae commented 4 years ago

@AlexanderFedyukov I tried the sample you provided and it does replicate the problem. It happens with all aparapi-jni that I've tried, 1.2.0, 1.3.1, 1.4.0, 1.4.1. I'll give it a look...

freemo commented 4 years ago

We should probably work this into a unit test as a first step. If you arent already doing that I can find some time this week. Then we can try to fix it.

On Tue, Aug 20, 2019 at 6:59 PM CoreRasurae notifications@github.com wrote:

@AlexanderFedyukov https://github.com/AlexanderFedyukov I tried the sample you provided and it does replicate the problem. It happens with all aparapi-jni that I've tried, 1.2.0, 1.3.1, 1.4.0, 1.4.1. I'll give it a look...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Syncleus/aparapi/issues/153?email_source=notifications&email_token=AAXESAWB365Q7LKDPDHVAJTQFQPFLA5CNFSM4IM6ERW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4W65DQ#issuecomment-523103886, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXESAWFPZQB6BI6XTGXFGTQFQPFLANCNFSM4IM6ERWQ .

CoreRasurae commented 4 years ago

The root cause of this bug is caused by a leftover from original unfinished Aparapi sketch, that is, the Segmentation fault occurs when a multidimensional array is present (e.g. dimension > 1) and JNIContext::dispose() from aparapi-native is called. When such happens JNIContext::dispose() will try to do: jenv->DeleteWeakGlobalRef((jweak) arg->aparapiBuffer->javaObject); but there is no corresponding call to NewWeakGlobalRef(...), because for 2D and 3D arrays there is no need to access data across JNI calls and thus no WeakGlobalRef is allocated in the first place, resulting in a free without allocate.

This was mimicking what is done with 1D arrays for arg->arrayBuffer->javaArray, except that 2D and 3D arrays are handled differently, since Java does not allocate contiguous memory for multidimensional arrays, so that Aparapi needs to handle them in a different manner.

The fix for this issue involves only aparapi-native.

CoreRasurae commented 4 years ago

Correction there are two different ways that allow the bug to be fixed: a) Remove all references to Java Object, by making aparapi-native retrieve the current address of the buffer, which shouldn't change between execution() and result retrieval(), despite crossing more than one JNI call. b) The one I ended up implementing: Make sure NewWeakGlobalRef(...) is called when a multidimensional array is provided as a Kernel argument for the Kernel, so that the original Array Java object address can be retrieved at a later time during a posterior JNI call.

AlexanderFedyukov commented 4 years ago

Sorry, your solution is not clear for me. Can you clarify, does workaround exists?

CoreRasurae commented 4 years ago

@AlexanderFedyukov A new version of Aparapi JNI is on its way, which will solve the issue.

freemo commented 4 years ago

fixed

AlexanderFedyukov commented 4 years ago

Good news! Thank you a lot!