Closed Berstanio closed 10 months ago
Whoooops, I didn't realised that all the code comments are on the top of the class... I'm sorry!
- On setCapabilities:
* If necessary, a jniNativeInterface copy is created and injected to the current thread (JavaThread::_jni_environment points to the copy).
* A pointer to the capabilities function pointer array is set to jniNativeInterface::reserved3.
So yeah, it works as I expected it to do.
Now I just have a follow up question, do you know, whether there is somewhere any documentation to the "reserved3" field in the struct? I couldn't find any.
Hey @Berstanio,
There's no documentation, it's just some reserved space before the function pointers. It was never used for anything and reserved0-3
are still unused up to the latest OpenJDK commit at the time I'm writing this.
However, there's no denying that it's a nasty hack and I plan to get rid of it in LWJGL 4.
The problem with OpenGL, OpenGL ES and OpenAL Soft (with ALC_EXT_thread_local_context
) remains, i.e. you cannot expose a static interface without some kind of thread-local lookup per method call. So my plan for LWJGL 4 is to propose moving these bindings to a non-static interface. If the community agrees, calls like glGetIntegerv
will become gl.GetIntegerv
, where gl
will be an instance that carries information about the current OpenGL context (GLCapabilities
will be baked into it too). Users will be responsible for passing it around their code (they can still put it in a thread-local for convenience, if they choose to do so). This is similar to how other OpenGL bindings work (e.g. JOGL).
Hey @Spasi , thank you very much on the insights! What I'm wondering is, does OpenAL soft also use reserved3 for the thread local context? If yes, how do the OpenGL/ES and the OpenAL threadlocal pointer not interfere, if they are writing to the same reserved3 field?
Also, as I understand it, the thread dependent lookup is only needed, if we render on multiple different threads, right? So if I don't get proper "reserved3" support into substratevm native-image, could I just fork of the natives, make the capabilities function pointer array global and use that, if I impose the limitation, that multi thread rendering is not allowed? Or do I misunderstand something?
Hey @Berstanio,
Your understanding is correct, it would have to use reserved2
for example to avoid the conflict. However, the OpenAL bindings do not use the JNIEnv
trick. It is not necessary, because there's a single implementation (OpenAL Soft), providing a fixed set of capabilities. The available function pointers will be the same in any thread/context. In the extremely unlikely scenario of multiple implementations, there's a fallback to a thread-local lookup. See how this is detected & handled here: https://github.com/LWJGL/lwjgl3/blob/master/modules/lwjgl/openal/src/main/java/org/lwjgl/openal/AL.java#L234.
In OpenGL's case, having different contexts is not so uncommon. You can have a single GPU/driver but contexts with different capabilities (core vs compatibility, forward compatibility enabled or not). You can have multiple GPUs from different vendors or different generations from the same vendor. Multithreading is not even a requirement for a conflict to arise, you can have a single rendering thread binding multiple OpenGL contexts.
Anyway, I'm not sure why Native Image would have trouble with the JNIEnv
trick and I would be interested to find out if you manage to figure it out. Even though what LWJGL does is horrible, it's technically sane and it would imply that Native Image breaks the JNI specification if it doesn't work there. (I could be wrong though)
As for a workaround, if you're going down the fork route: Just drop the natives (i.e. lwjgl_opengl.dll/so/dylib
) and change the generated code for the OpenGL bindings to use a Java-side lookup. This is very simple, just change JNI_CAPABILITIES
to JAVA_CAPABILITIES
on this line: https://github.com/LWJGL/lwjgl3/blob/master/modules/lwjgl/opengl/src/templates/kotlin/opengl/GLBinding.kt#L29 and regenerate the bindings. This will revert the OpenGL bindings to work exactly like OpenAL, with the same "incompatible context" detection. If you only create a single context, or if every context you create is compatible with each other (same capabilities & same function pointers), then the lookup will be very cheap (no TLS involved).
Hey @Spasi , thank you for the clarifications on OpenGL an OpenAL!
Anyway, I'm not sure why Native Image would have trouble with the JNIEnv trick and I would be interested to find out if you manage to figure it out.
I'm not quite sure yet too, my first guess would be, that the struct is static in the executable data section or something like that. But this is just a guess. I opened a graal issue about it, feel free to follow/participate on the ticket if you like! https://github.com/oracle/graal/issues/6391
Until the issue is properly sorted out on the graal end, your workaround will be very helpful to further test around with native image and lwjgl3, thank you a lot for pointing me out to this!
Hey @Spasi ,
I was able to create substitutions for the OpenGL classes, and now native-images runs fine on linux and mac without a fork, thanks a lot again 🥳 https://github.com/Berstanio/gdx-graalhelper/blob/dev/backends/lwjgl3/src/main/java/org/lwjgl/opengl/Lwjgl3NativeSubstitutions.java https://github.com/Berstanio/gdx-graalhelper/blob/dev/backends/lwjgl3/src/main/java/com/anyicomplex/gdx/lwjgl3/svm/Lwjgl3Substitutions.java
However, windows is still a problem, since it also uses reserved2. https://github.com/oracle/graal/issues/6428#issuecomment-1510387011 Since I couldn't think of a easy workaround, I wanted to ask you, whether one comes to your mind?
Also, it seems lucky that linux/mac currently works, since it seems there are also code pathes that want to access the EnvData in reserved2.
Hey @Berstanio,
That one won't be as easy to fix and indeed, it affects Linux/macOS (errno
) as well as Windows (LastError
).
You'll have to revert saveErrno
, saveLastError
and the corresponding getter functions to the old implementation, which used native TLS for storage. See this commit for details.
Btw, got any insight to what may be going wrong with Native Image? Is it using the reserved
fields for something internal? All four of them?
Hey @Spasi ,
That one won't be as easy to fix
That was what I was afraid of. I don't really want to build my own natives for every platform etc.
Buut, I have some rather good news:
Btw, got any insight to what may be going wrong with Native Image?
Yes, after some digging I think I have now some understanding about it! native-image also doesn't do anything with these fields. However, I inspected the location of the JNINativeInterface_ struct with vmmap and got:
---> mapped file 101a00000-101d00000 [ 3072K 704K 0K 0K] r--/rwx SM=COW Object_id=3198563b
mapped file 101d00000-102024000 [ 3216K 1008K 624K 0K] rw-/rwx SM=COW Object_id=2b8bfa3b
This is the result on a "default" execution of native-image. By default native-image has Isolates activated, which seems to mmap the static heap per isolate. That seems to be, what we see here. It gets a bit clearer when deactivating Isolates:
---> __DATA 102e64000-1032b4000 [ 4416K 912K 0K 0K] r--/rw- SM=COW /Users/USER/*/ReservedSegfault-1.0-SNAPSHOT_NO_ISOLATES
__DATA 1032b4000-103588000 [ 2896K 464K 368K 16K] rw-/rw- SM=COW /Users/USER/*/ReservedSegfault-1.0-SNAPSHOT_NO_ISOLATES
So, the JNINativeInterface_ struct lies in a section of the binary, that was made read only on purpose by native-image.
The easiest way I found to work around this is to disable Isolates and disable the "Read-Only" property of the __DATA section, by passing:
-H:+ForceNoROSectionRelocations -H:-SpawnIsolates
as a native-image parameter.
This way, a write to the JNINativeInterface struct doesn't segfault. However, because "reserved0-3" seem to be initialized with garbage at the start (not NULL), ThreadLocalUtils#setupEnvData needs to be called manually.
This is all not ideal, but it works good enough that I can use it, until graal has a proper fix for being able to write to reserved0-3.
What I'm wondering a bit now is though, whether the native-image JNINativeInterface struct is even thread local, if it seems to be hard coded in the __DATA section? I guess with Isolates it is, but not sure whether without.
Anyway, thanks for all your insights and help on this!
One last thing I figured I might should mention, it may be that Espresso Truffle (a graal Java interpreter) uses the reserved0-2 fields internally, just so that you are aware. But I guess, no one runs java games through a interpreter, so shouldn't be any problem.
Hey @Berstanio,
Thanks for the interesting information! Some feedback that might help:
JNINativeInterface_
struct allocated by the JVM, which is shared by all threads. It never changes, unless -Xcheck:jni
is enabled, or new JNI functions are injected with JVMTI.JNIEnv _jni_environment
field in the JavaThread
class. This is a pointer to the JNINativeInterface_
struct. It is what enables this whole thing to work.reserved0-3
of JNINativeInterface_
structs allocated by the JVM. Instead, it stores the JVM struct stored in _jni_environment
(so that it can restore it later), allocates a new JNINativeInterface_
struct, copies the function pointers from the JVM struct, then sets _jni_environment
so that it points to the new struct. The reserved0-3
fields are only changed on the LWJGL-allocated struct, which should always be in accessible memory. So, if something fails on Native Image, I think it's the write to *JNIEnv
.I cloned Graal and examined the code a bit, but couldn't find a better explanation. I'm not at all familiar with that codebase, so you might have more luck than me.
What I did find though is why reserved0-3
have non-NULL
, seemingly garbage data. Graal initializes the JNINativeInterfaces_
struct by setting all fields (included the reserved ones) to a function called UnimplementedWithJNIEnvArgument
, which simply kills the JVM when called. It then sets the actual JNI function pointers, depending on the supported JNI version. I think we might be able to work around this by comparing reserved3
to reserved0
instead of NULL
?
Oh no, I forgot about: https://github.com/LWJGL/lwjgl3/blob/4f9767cd5eb87c0e682ec5107b3601bc409d93a9/modules/lwjgl/core/src/main/java/org/lwjgl/system/ThreadLocalUtil.java#L177
@Berstanio could you please confirm that this is where it crashes on Native Image (with Isolates enabled)?
It would be good news if that is indeed the case. We don't have to use FUNCTION_MISSING_ABORT
at all, it is just a "friendlier" way to crash the JVM in programs that are already broken.
Hey @Spasi ,
First of all, thanks for all the clarifications and insights! Well, I have weird findings to report and I honestly don't know, whether I'm missing something completly obvious. On mac, it now just works. Without any modification, without any extra command line parameters, it just runs flawless. I have not upgraded my graal version, I have made sure, that all the correct code is pulled in. And I know this was broken on mac! mac was the first platform I started testing on and debugging this error. It is very strange and to this point I'm still not convinced, that I don't miss something obvious.
Windows there seems to be more "reasonable" working. If I patch out the "ThreadLocalUtil#setFunctionMissingAddresses" function and run ThreadLocalUtil#setupEnvData on startup manually (cause the reserved2 == null check is broken), it runs fine there. This is believable imo.
Graal initializes the JNINativeInterfaces_ struct by setting all fields (included the reserved ones) to a function called UnimplementedWithJNIEnvArgument, which simply kills the JVM when called. It then sets the actual JNI function pointers, depending on the supported JNI version. I think we might be able to work around this by comparing reserved3 to reserved0 instead of NULL?
This is a nice finding that I totally missed! The workaround is probably not ideal, but it should work!
Also on another note, should I pr my lwjgl3 reflection/jni/substituions rules for native-image? I was not sure, whether lwjgl3 will even get a next release or whether the next release will be lwjgl4?
Hi @Spasi ,
just a little follow up question about this:
I think we might be able to work around this by comparing reserved3 to reserved0 instead of NULL?
It would be great to have this inside lwjgl3, so I was wondering whether I should try and build a pr, or whether you want to take care of this?
Hey @Berstanio,
I've had GraalVM downloaded and waiting on my mac for the past month... I meant to do some testing on my own before pushing fixes, but other stuff kept me super busy, then I completely forgot about it!
I'll make sure the next 3.3.3 snapshot has the reserved3
fix. You're welcome to open PRs for other stuff. However, if we also deal with setFunctionMissingAddresses
, I think everything else should work as is.
Hey @Spasi ,
I'll make sure the next 3.3.3 snapshot has the reserved3 fix.
Thank you very much, that sounds great!
However, if we also deal with setFunctionMissingAddresses, I think everything else should work as is.
Dealing with that should be easy, I would solve it like this:
https://github.com/Berstanio/gdx-graalhelper/blob/what/backends/lwjgl3/src/main/java/com/anyicomplex/gdx/lwjgl3/svm/Lwjgl3Substitutions.java
This way, the function gets only patched out for native-image
I plan to pr all my lwjgl3 native-image configuration, if I have the feeling they are good enough. If you are doing your own testing, you might want to take a look at the lwjgl3 specific configuration done here, it might save you some time: https://github.com/Berstanio/gdx-graalhelper/tree/what/backends/lwjgl3/src/main
Please note, that the "master" branch contains my old seemingly unneccessary substitutions, the "what" branch (named after the confusion I had, when stuff was just working out of nowhere), is the up-to date one.
Hey @Berstanio,
I just pushed the native-image
branch. I tested it with GraalVM CE 17-22.3.2 + Native Image on Linux and Windows and the LWJGL demos seem to be working fine.
build_native_image.bat/sh
) for convenient testing.-agentlib:native-image-agent
to produce the native-image configuration (for JNI, reflection and resources). The output had weird formatting and plenty of unnecessary stuff (I think). I cleaned it up and the config files are now under config/native-image
. This way I didn't have to use --features
with Feature
implementations, so the codebase has no dependency to GraalVM.Unsafe.allocateInstance
. I'll have to do more testing on this (has performance implications for escape analysis), but I don't think we can avoid it. Otherwise we'd have to declare the unsafeAllocated
property for all struct classes (there are literally thousands of them).setFunctionMissingAddresses
(now a no-op on Native Image).reserved
fields are initialized.Please let me know how it works for you.
Hey @Spasi ,
thank you very much for your work, I will try it out!
The biggest change (first commit) was to refactor struct & custom buffer classes to be allocated normally, instead of via Unsafe.allocateInstance. I'll have to do more testing on this (has performance implications for escape analysis), but I don't think we can avoid it. Otherwise we'd have to declare the unsafeAllocated property for all struct classes (there are literally thousands of them).
As I understand it, you don't like the idea of having Feature
in the lwjgl3 codebase? Because features can register these classes conditionally, based on reachability, which avoids declaring unsafeAllocated
thousands of times.
Also, I think this is not neccessary, since it is only called by JNI? https://github.com/LWJGL/lwjgl3/blob/ef498be4fcddd397bc457f7fce02e905f240848d/config/native-image/META-INF/native-image/reflect-config.json#L12-L23
As I understand it, you don't like the idea of having Feature in the lwjgl3 codebase?
Yeah, I mean, if we can just put these json files under META-INF
in lwjgl.jar, then it should just work for downstream users, right? We don't need a dependency to the GraalVM SDK and having to compile extra code.
Because features can register these classes conditionally, based on reachability, which avoids declaring unsafeAllocated thousands of times.
Not sure I understand this, could you post some example code?
Also, I think this is not neccessary, since it is only called by JNI?
I think it's necessary because we use reflection here:
and pass the method to JNI to set it up (with FromReflectedMethod
).
Not sure I understand this, could you post some example code?
Sure, I have done this in my code:
access.registerSubtypeReachabilityHandler((duringAnalysisAccess, aClass) ->
// Would only need to be one constructor I think, but w/e
RuntimeReflection.register(aClass.getDeclaredConstructors()),
org.lwjgl.system.Struct.class);
access.registerSubtypeReachabilityHandler((duringAnalysisAccess, aClass) ->
// Would only need to be one constructor I think, but w/e
RuntimeReflection.register(aClass.getDeclaredConstructors()),
org.lwjgl.system.StructBuffer.class);
This way, all structs are able to get allocated by unsafe, but only if they are reachable. This kind of conditional registration can't be currently done using the jsons, sadly.
I think it's necessary because we use reflection here: https://github.com/LWJGL/lwjgl3/blob/ef498be4fcddd397bc457f7fce02e905f240848d/modules/lwjgl/core/src/main/java/org/lwjgl/system/Callback.java#L97 and pass the method to JNI to set it up (with FromReflectedMethod).
Ahh, I see! Since it is a direct and clear reflection access, native-image is probably able to register that automatically based on analysis, so I never had a problem with not defining it.
registerSubtypeReachabilityHandler
Right, this does indeed apply the handler to all subclasses. However, we won't be needing it, I've now performance tested the refactored structs and there hasn't been a negative impact.
native-image is probably able to register that automatically based on analysis
You're correct, works fine without it.
I've updated the branch with various cleanups.
Right, this does indeed apply the handler to all subclasses. However, we won't be needing it, I've now performance tested the refactored structs and there hasn't been a negative impact.
Ahh, thats great to hear!
I've updated the branch with various cleanups.
Thank you very much, I tested them on mac and it works fine! Once the fixes are integrated into a snapshot, I can also do more complex tests on other platforms.
@Berstanio thank you for your patience and great feedback! The above have been integrated and will be available in the next 3.3.3 snapshot.
Thank you very much for implementing this, I highly appreciate it!
Disclaimer: I work on the GraalVM team.
Good to see that lwjgl3 is now compatible with GraalVM Native Image! If this is tested somewhere (e.g., using GitHub Actions), maybe the library could be added to the list at https://www.graalvm.org/native-image/libraries-and-frameworks/? Feel free to get in touch if you have any questions!
Hey @fniephaus,
There is now a GitHub Actions workflow that builds and tests LWJGL, then runs a simple demo with OpenJDK JIT, GraalVM JIT and GraalVM Native Image:
Afaik GraalVM cannot cross-compile, so only the Linux x64, macOS x64 and Windows x64 jobs build and run the demo.
Cool, thanks for the info, @Spasi. I think the library may qualify to listed as "fully tested" at https://www.graalvm.org/native-image/libraries-and-frameworks/. Click here for instructions what you'd need to do. :)
Question
Hey,
first of all I hope this is the right place to ask such question. I'm currently experementing with lwjgl3 and native-image. Currently I get a segault in ThreadLocalUtil#setCapabilities and I would like to understand the method first, before I proceed with e.g. opening a graal issue.
So, how I understand it
capabilities
a array of function pointer. This array will be later used to dispatch different GL functions.Then, in the line
memPutAddress(env_p + CAPABILITIES_OFFSET, capabilities);
, the function array will be put insidereserved3
field in the JNIEnv struct. From there it will be used to dispatch the functions.Is that correct?