LWJGL / lwjgl3

LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL, Vulkan, bgfx), audio (OpenAL, Opus), parallel computing (OpenCL, CUDA) and XR (OpenVR, LibOVR, OpenXR) applications.
https://www.lwjgl.org
BSD 3-Clause "New" or "Revised" License
4.67k stars 628 forks source link

How does ThreadLocalUtil#setCapabilities works? #875

Closed Berstanio closed 10 months ago

Berstanio commented 1 year ago

Question

Hey,

first of all I hope this is the right place to ask such question. I'm currently experementing with lwjgl3 and native-image. Currently I get a segault in ThreadLocalUtil#setCapabilities and I would like to understand the method first, before I proceed with e.g. opening a graal issue.

    public static void setCapabilities(long capabilities) {
        // Get thread's JNIEnv
        long env_pp = getThreadJNIEnv();
        long env_p  = memGetAddress(env_pp);

        if (capabilities == NULL) {
            if (env_p != JNI_NATIVE_INTERFACE) {
                memPutAddress(env_p + CAPABILITIES_OFFSET,
                    // FUNCTION_MISSING_ABORT table
                    memGetAddress(JNI_NATIVE_INTERFACE + CAPABILITIES_OFFSET));
            }
        } else {
            if (env_p == JNI_NATIVE_INTERFACE) {
                setupEnvData();
                env_p = memGetAddress(env_pp);
            }
            memPutAddress(env_p + CAPABILITIES_OFFSET, capabilities);
        }
    }

So, how I understand it capabilities a array of function pointer. This array will be later used to dispatch different GL functions.

struct JNINativeInterface {
    void*       reserved0;
    void*       reserved1;
    void*       reserved2;
    void*       reserved3;
[...]

Then, in the line memPutAddress(env_p + CAPABILITIES_OFFSET, capabilities);, the function array will be put inside reserved3 field in the JNIEnv struct. From there it will be used to dispatch the functions.

Is that correct?

Berstanio commented 1 year ago

Whoooops, I didn't realised that all the code comments are on the top of the class... I'm sorry!

    - On setCapabilities:
        * If necessary, a jniNativeInterface copy is created and injected to the current thread (JavaThread::_jni_environment points to the copy).
        * A pointer to the capabilities function pointer array is set to jniNativeInterface::reserved3.

So yeah, it works as I expected it to do.

Now I just have a follow up question, do you know, whether there is somewhere any documentation to the "reserved3" field in the struct? I couldn't find any.

Spasi commented 1 year ago

Hey @Berstanio,

There's no documentation, it's just some reserved space before the function pointers. It was never used for anything and reserved0-3 are still unused up to the latest OpenJDK commit at the time I'm writing this.

However, there's no denying that it's a nasty hack and I plan to get rid of it in LWJGL 4.

The problem with OpenGL, OpenGL ES and OpenAL Soft (with ALC_EXT_thread_local_context) remains, i.e. you cannot expose a static interface without some kind of thread-local lookup per method call. So my plan for LWJGL 4 is to propose moving these bindings to a non-static interface. If the community agrees, calls like glGetIntegerv will become gl.GetIntegerv, where gl will be an instance that carries information about the current OpenGL context (GLCapabilities will be baked into it too). Users will be responsible for passing it around their code (they can still put it in a thread-local for convenience, if they choose to do so). This is similar to how other OpenGL bindings work (e.g. JOGL).

Berstanio commented 1 year ago

Hey @Spasi , thank you very much on the insights! What I'm wondering is, does OpenAL soft also use reserved3 for the thread local context? If yes, how do the OpenGL/ES and the OpenAL threadlocal pointer not interfere, if they are writing to the same reserved3 field?

Also, as I understand it, the thread dependent lookup is only needed, if we render on multiple different threads, right? So if I don't get proper "reserved3" support into substratevm native-image, could I just fork of the natives, make the capabilities function pointer array global and use that, if I impose the limitation, that multi thread rendering is not allowed? Or do I misunderstand something?

Spasi commented 1 year ago

Hey @Berstanio,

Your understanding is correct, it would have to use reserved2 for example to avoid the conflict. However, the OpenAL bindings do not use the JNIEnv trick. It is not necessary, because there's a single implementation (OpenAL Soft), providing a fixed set of capabilities. The available function pointers will be the same in any thread/context. In the extremely unlikely scenario of multiple implementations, there's a fallback to a thread-local lookup. See how this is detected & handled here: https://github.com/LWJGL/lwjgl3/blob/master/modules/lwjgl/openal/src/main/java/org/lwjgl/openal/AL.java#L234.

In OpenGL's case, having different contexts is not so uncommon. You can have a single GPU/driver but contexts with different capabilities (core vs compatibility, forward compatibility enabled or not). You can have multiple GPUs from different vendors or different generations from the same vendor. Multithreading is not even a requirement for a conflict to arise, you can have a single rendering thread binding multiple OpenGL contexts.

Anyway, I'm not sure why Native Image would have trouble with the JNIEnv trick and I would be interested to find out if you manage to figure it out. Even though what LWJGL does is horrible, it's technically sane and it would imply that Native Image breaks the JNI specification if it doesn't work there. (I could be wrong though)

As for a workaround, if you're going down the fork route: Just drop the natives (i.e. lwjgl_opengl.dll/so/dylib) and change the generated code for the OpenGL bindings to use a Java-side lookup. This is very simple, just change JNI_CAPABILITIES to JAVA_CAPABILITIES on this line: https://github.com/LWJGL/lwjgl3/blob/master/modules/lwjgl/opengl/src/templates/kotlin/opengl/GLBinding.kt#L29 and regenerate the bindings. This will revert the OpenGL bindings to work exactly like OpenAL, with the same "incompatible context" detection. If you only create a single context, or if every context you create is compatible with each other (same capabilities & same function pointers), then the lookup will be very cheap (no TLS involved).

Berstanio commented 1 year ago

Hey @Spasi , thank you for the clarifications on OpenGL an OpenAL!

Anyway, I'm not sure why Native Image would have trouble with the JNIEnv trick and I would be interested to find out if you manage to figure it out.

I'm not quite sure yet too, my first guess would be, that the struct is static in the executable data section or something like that. But this is just a guess. I opened a graal issue about it, feel free to follow/participate on the ticket if you like! https://github.com/oracle/graal/issues/6391

Until the issue is properly sorted out on the graal end, your workaround will be very helpful to further test around with native image and lwjgl3, thank you a lot for pointing me out to this!

Berstanio commented 1 year ago

Hey @Spasi ,

I was able to create substitutions for the OpenGL classes, and now native-images runs fine on linux and mac without a fork, thanks a lot again 🥳 https://github.com/Berstanio/gdx-graalhelper/blob/dev/backends/lwjgl3/src/main/java/org/lwjgl/opengl/Lwjgl3NativeSubstitutions.java https://github.com/Berstanio/gdx-graalhelper/blob/dev/backends/lwjgl3/src/main/java/com/anyicomplex/gdx/lwjgl3/svm/Lwjgl3Substitutions.java

However, windows is still a problem, since it also uses reserved2. https://github.com/oracle/graal/issues/6428#issuecomment-1510387011 Since I couldn't think of a easy workaround, I wanted to ask you, whether one comes to your mind?

Also, it seems lucky that linux/mac currently works, since it seems there are also code pathes that want to access the EnvData in reserved2.

Spasi commented 1 year ago

Hey @Berstanio,

That one won't be as easy to fix and indeed, it affects Linux/macOS (errno) as well as Windows (LastError).

You'll have to revert saveErrno, saveLastError and the corresponding getter functions to the old implementation, which used native TLS for storage. See this commit for details.

Btw, got any insight to what may be going wrong with Native Image? Is it using the reserved fields for something internal? All four of them?

Berstanio commented 1 year ago

Hey @Spasi ,

That one won't be as easy to fix

That was what I was afraid of. I don't really want to build my own natives for every platform etc.

Buut, I have some rather good news:

Btw, got any insight to what may be going wrong with Native Image?

Yes, after some digging I think I have now some understanding about it! native-image also doesn't do anything with these fields. However, I inspected the location of the JNINativeInterface_ struct with vmmap and got:

--->  mapped file                 101a00000-101d00000    [ 3072K   704K     0K     0K] r--/rwx SM=COW          Object_id=3198563b
      mapped file                 101d00000-102024000    [ 3216K  1008K   624K     0K] rw-/rwx SM=COW          Object_id=2b8bfa3b

This is the result on a "default" execution of native-image. By default native-image has Isolates activated, which seems to mmap the static heap per isolate. That seems to be, what we see here. It gets a bit clearer when deactivating Isolates:

--->  __DATA                      102e64000-1032b4000    [ 4416K   912K     0K     0K] r--/rw- SM=COW          /Users/USER/*/ReservedSegfault-1.0-SNAPSHOT_NO_ISOLATES
      __DATA                      1032b4000-103588000    [ 2896K   464K   368K    16K] rw-/rw- SM=COW          /Users/USER/*/ReservedSegfault-1.0-SNAPSHOT_NO_ISOLATES

So, the JNINativeInterface_ struct lies in a section of the binary, that was made read only on purpose by native-image.

The easiest way I found to work around this is to disable Isolates and disable the "Read-Only" property of the __DATA section, by passing: -H:+ForceNoROSectionRelocations -H:-SpawnIsolates as a native-image parameter. This way, a write to the JNINativeInterface struct doesn't segfault. However, because "reserved0-3" seem to be initialized with garbage at the start (not NULL), ThreadLocalUtils#setupEnvData needs to be called manually. This is all not ideal, but it works good enough that I can use it, until graal has a proper fix for being able to write to reserved0-3. What I'm wondering a bit now is though, whether the native-image JNINativeInterface struct is even thread local, if it seems to be hard coded in the __DATA section? I guess with Isolates it is, but not sure whether without. Anyway, thanks for all your insights and help on this!

One last thing I figured I might should mention, it may be that Espresso Truffle (a graal Java interpreter) uses the reserved0-2 fields internally, just so that you are aware. But I guess, no one runs java games through a interpreter, so shouldn't be any problem.

Spasi commented 1 year ago

Hey @Berstanio,

Thanks for the interesting information! Some feedback that might help:

I cloned Graal and examined the code a bit, but couldn't find a better explanation. I'm not at all familiar with that codebase, so you might have more luck than me.

What I did find though is why reserved0-3 have non-NULL, seemingly garbage data. Graal initializes the JNINativeInterfaces_ struct by setting all fields (included the reserved ones) to a function called UnimplementedWithJNIEnvArgument, which simply kills the JVM when called. It then sets the actual JNI function pointers, depending on the supported JNI version. I think we might be able to work around this by comparing reserved3 to reserved0 instead of NULL?

Spasi commented 1 year ago

Oh no, I forgot about: https://github.com/LWJGL/lwjgl3/blob/4f9767cd5eb87c0e682ec5107b3601bc409d93a9/modules/lwjgl/core/src/main/java/org/lwjgl/system/ThreadLocalUtil.java#L177

@Berstanio could you please confirm that this is where it crashes on Native Image (with Isolates enabled)?

It would be good news if that is indeed the case. We don't have to use FUNCTION_MISSING_ABORT at all, it is just a "friendlier" way to crash the JVM in programs that are already broken.

Berstanio commented 1 year ago

Hey @Spasi ,

First of all, thanks for all the clarifications and insights! Well, I have weird findings to report and I honestly don't know, whether I'm missing something completly obvious. On mac, it now just works. Without any modification, without any extra command line parameters, it just runs flawless. I have not upgraded my graal version, I have made sure, that all the correct code is pulled in. And I know this was broken on mac! mac was the first platform I started testing on and debugging this error. It is very strange and to this point I'm still not convinced, that I don't miss something obvious.

Windows there seems to be more "reasonable" working. If I patch out the "ThreadLocalUtil#setFunctionMissingAddresses" function and run ThreadLocalUtil#setupEnvData on startup manually (cause the reserved2 == null check is broken), it runs fine there. This is believable imo.

Graal initializes the JNINativeInterfaces_ struct by setting all fields (included the reserved ones) to a function called UnimplementedWithJNIEnvArgument, which simply kills the JVM when called. It then sets the actual JNI function pointers, depending on the supported JNI version. I think we might be able to work around this by comparing reserved3 to reserved0 instead of NULL?

This is a nice finding that I totally missed! The workaround is probably not ideal, but it should work!

Also on another note, should I pr my lwjgl3 reflection/jni/substituions rules for native-image? I was not sure, whether lwjgl3 will even get a next release or whether the next release will be lwjgl4?

Berstanio commented 1 year ago

Hi @Spasi ,

just a little follow up question about this:

I think we might be able to work around this by comparing reserved3 to reserved0 instead of NULL?

It would be great to have this inside lwjgl3, so I was wondering whether I should try and build a pr, or whether you want to take care of this?

Spasi commented 1 year ago

Hey @Berstanio,

I've had GraalVM downloaded and waiting on my mac for the past month... I meant to do some testing on my own before pushing fixes, but other stuff kept me super busy, then I completely forgot about it!

I'll make sure the next 3.3.3 snapshot has the reserved3 fix. You're welcome to open PRs for other stuff. However, if we also deal with setFunctionMissingAddresses, I think everything else should work as is.

Berstanio commented 1 year ago

Hey @Spasi ,

I'll make sure the next 3.3.3 snapshot has the reserved3 fix.

Thank you very much, that sounds great!

However, if we also deal with setFunctionMissingAddresses, I think everything else should work as is.

Dealing with that should be easy, I would solve it like this:
https://github.com/Berstanio/gdx-graalhelper/blob/what/backends/lwjgl3/src/main/java/com/anyicomplex/gdx/lwjgl3/svm/Lwjgl3Substitutions.java This way, the function gets only patched out for native-image

I plan to pr all my lwjgl3 native-image configuration, if I have the feeling they are good enough. If you are doing your own testing, you might want to take a look at the lwjgl3 specific configuration done here, it might save you some time: https://github.com/Berstanio/gdx-graalhelper/tree/what/backends/lwjgl3/src/main

Please note, that the "master" branch contains my old seemingly unneccessary substitutions, the "what" branch (named after the confusion I had, when stuff was just working out of nowhere), is the up-to date one.

Spasi commented 1 year ago

Hey @Berstanio,

I just pushed the native-image branch. I tested it with GraalVM CE 17-22.3.2 + Native Image on Linux and Windows and the LWJGL demos seem to be working fine.

Please let me know how it works for you.

Berstanio commented 1 year ago

Hey @Spasi ,

thank you very much for your work, I will try it out!

The biggest change (first commit) was to refactor struct & custom buffer classes to be allocated normally, instead of via Unsafe.allocateInstance. I'll have to do more testing on this (has performance implications for escape analysis), but I don't think we can avoid it. Otherwise we'd have to declare the unsafeAllocated property for all struct classes (there are literally thousands of them).

As I understand it, you don't like the idea of having Feature in the lwjgl3 codebase? Because features can register these classes conditionally, based on reachability, which avoids declaring unsafeAllocated thousands of times.

Also, I think this is not neccessary, since it is only called by JNI? https://github.com/LWJGL/lwjgl3/blob/ef498be4fcddd397bc457f7fce02e905f240848d/config/native-image/META-INF/native-image/reflect-config.json#L12-L23

Spasi commented 1 year ago

As I understand it, you don't like the idea of having Feature in the lwjgl3 codebase?

Yeah, I mean, if we can just put these json files under META-INF in lwjgl.jar, then it should just work for downstream users, right? We don't need a dependency to the GraalVM SDK and having to compile extra code.

Because features can register these classes conditionally, based on reachability, which avoids declaring unsafeAllocated thousands of times.

Not sure I understand this, could you post some example code?

Also, I think this is not neccessary, since it is only called by JNI?

I think it's necessary because we use reflection here:

https://github.com/LWJGL/lwjgl3/blob/ef498be4fcddd397bc457f7fce02e905f240848d/modules/lwjgl/core/src/main/java/org/lwjgl/system/Callback.java#L97

and pass the method to JNI to set it up (with FromReflectedMethod).

Berstanio commented 1 year ago

Not sure I understand this, could you post some example code?

Sure, I have done this in my code:

            access.registerSubtypeReachabilityHandler((duringAnalysisAccess, aClass) ->
                            // Would only need to be one constructor I think, but w/e
                            RuntimeReflection.register(aClass.getDeclaredConstructors()),
                    org.lwjgl.system.Struct.class);
            access.registerSubtypeReachabilityHandler((duringAnalysisAccess, aClass) ->
                            // Would only need to be one constructor I think, but w/e
                            RuntimeReflection.register(aClass.getDeclaredConstructors()),
                    org.lwjgl.system.StructBuffer.class);

This way, all structs are able to get allocated by unsafe, but only if they are reachable. This kind of conditional registration can't be currently done using the jsons, sadly.

I think it's necessary because we use reflection here: https://github.com/LWJGL/lwjgl3/blob/ef498be4fcddd397bc457f7fce02e905f240848d/modules/lwjgl/core/src/main/java/org/lwjgl/system/Callback.java#L97 and pass the method to JNI to set it up (with FromReflectedMethod).

Ahh, I see! Since it is a direct and clear reflection access, native-image is probably able to register that automatically based on analysis, so I never had a problem with not defining it.

Spasi commented 1 year ago

registerSubtypeReachabilityHandler

Right, this does indeed apply the handler to all subclasses. However, we won't be needing it, I've now performance tested the refactored structs and there hasn't been a negative impact.

native-image is probably able to register that automatically based on analysis

You're correct, works fine without it.

I've updated the branch with various cleanups.

Berstanio commented 1 year ago

Right, this does indeed apply the handler to all subclasses. However, we won't be needing it, I've now performance tested the refactored structs and there hasn't been a negative impact.

Ahh, thats great to hear!

I've updated the branch with various cleanups.

Thank you very much, I tested them on mac and it works fine! Once the fixes are integrated into a snapshot, I can also do more complex tests on other platforms.

Spasi commented 10 months ago

@Berstanio thank you for your patience and great feedback! The above have been integrated and will be available in the next 3.3.3 snapshot.

Berstanio commented 10 months ago

Thank you very much for implementing this, I highly appreciate it!

fniephaus commented 9 months ago

Disclaimer: I work on the GraalVM team.

Good to see that lwjgl3 is now compatible with GraalVM Native Image! If this is tested somewhere (e.g., using GitHub Actions), maybe the library could be added to the list at https://www.graalvm.org/native-image/libraries-and-frameworks/? Feel free to get in touch if you have any questions!

Spasi commented 9 months ago

Hey @fniephaus,

There is now a GitHub Actions workflow that builds and tests LWJGL, then runs a simple demo with OpenJDK JIT, GraalVM JIT and GraalVM Native Image:

build.yml

Afaik GraalVM cannot cross-compile, so only the Linux x64, macOS x64 and Windows x64 jobs build and run the demo.

fniephaus commented 8 months ago

Cool, thanks for the info, @Spasi. I think the library may qualify to listed as "fully tested" at https://www.graalvm.org/native-image/libraries-and-frameworks/. Click here for instructions what you'd need to do. :)