GoogleCloudPlatform / native-image-support-java

Enables GraalVM Native Image support for Google Cloud Java Client Libraries.
Apache License 2.0
80 stars 21 forks source link

Static linking of native resources #128

Closed JonathanGiles closed 1 year ago

JonathanGiles commented 3 years ago

Hi there - very nice project you have here! :-)

I'm curious if the resource-config.json file you have means that the native dependencies referenced there are dynamically linked at runtime on the users machine, if they exist, and are not statically compiled into the final executable native image? It seems that to statically link the native dependencies is another step, and so I'm curious on what your thinking is here.

Thanks!

dzou commented 3 years ago

Hey @JonathanGiles, thanks for asking.

I'm not too familiar with the lower-level details of how this works (i.e. if these are linked statically vs. dynamically). But I can share with you how Netty uses the native libraries and maybe this can share some insight.

I'm not sure if this sheds some more insight into the static/dynamic linking question. But our strategy here was to just include the libraries as resources and let Netty load the native libraries via their own method.

JonathanGiles commented 3 years ago

Thanks for taking the time to reply. When I investigated this I found that no matter what I did, I couldn't seem to get my native image executable to report that the epoll or OpenSSL functionality was available, and I couldn't get it to report any other SSLContext type other than the default JDK built-in JdkSslServerContext variant, despite the fact it worked as expected when run as a Java application. This was the code I was running in both Java and native-image variants:

System.out.println("Epoll.isAvailable(): " + Epoll.isAvailable());
System.out.println("OpenSsl.isAvailable(): " + OpenSsl.isAvailable());

final SelfSignedCertificate ssc = new SelfSignedCertificate("blah.net");
final SslContext sslCtx = SslContextBuilder.forServer(ssc.certificate(), ssc.privateKey()).build();

System.out.println("SSL Context is " + sslCtx + ", class is " + sslCtx.getClass());

My interpretation of this is that the libraries are being included as resources in the native image build, but they are not being found at runtime and therefore are not being loaded. This means that Netty is falling back to using the Java code instead, which is of course slower.

I don't know if this is because I am on macos, or if the issue is the same on other platforms. I would love to hear if you get a different output than what I am getting with the code above.

dzou commented 3 years ago

Hmm I see. Yeah I get the same output, looks like it is not getting loaded correctly:

Epoll.isAvailable(): false
OpenSsl.isAvailable(): false

Thanks for catching this; I will take a closer look to see how to fix this. Maybe we do need to add the extra steps you mentioned in https://github.com/oracle/graal/issues/3359

Edit: Taking a closer read at the bug you linked, I think it addresses the same issue we have:

native-image can currently produce static or mostly static executables, except for object files loaded with System/loadLibrary when using JNI.

And indeed we've been relying on Netty to do the loading with System.loadLibrary(..).

Looking at the Netty code there's a lot of fallback conditions and errors that get suppressed; I'm going to first try to see if I can get those .debug(..) log statements to print and make sure we didn't miss an error. Then will try the Feature approach discussed in the bug.

JonathanGiles commented 3 years ago

I tried a lot of different variations of the Feature approach - but I couldn't find the right variation, because in all cases the library was not being found and this resulted in the less optimal JDK SSL path being taken.

The code I ended up with (with a number of different experiments clearly showing) is below. I've not cleaned it up as it gives some indication to the paths I went down. Note the first method is something I want to remove, when the resources are statically linked. Also note that I'm developing on macos, so the code will need to be extended for Linux or Windows.

import com.oracle.svm.core.configure.ResourcesRegistry;
import com.oracle.svm.core.jdk.NativeLibrarySupport;
import com.oracle.svm.core.jdk.PlatformNativeLibrarySupport;
import com.oracle.svm.hosted.FeatureImpl;
import com.oracle.svm.hosted.c.NativeLibraries;
import org.graalvm.nativeimage.ImageSingletons;
import org.graalvm.nativeimage.Platform;
import org.graalvm.nativeimage.hosted.Feature;

/**
 * This class registers native libraries that Netty supports. If the feature is enabled, these libraries will be
 * statically linked into the resulting native image. If this feature is not enabled, the standard JDK implementations
 * of these features will be used instead (which is slower, but results in a small native image).
 */
public class NativeNettyLibsFeature implements Feature {
    @Override
    public void duringSetup(DuringSetupAccess access) {
        // NOTE - this code is only here to show my approach *BEFORE* I realised I wasn't including the libraries statically. 
        // Ideally it would all be deleted in favour of the static linking below.
        ResourcesRegistry resourceRegistry = ImageSingletons.lookup(ResourcesRegistry.class);

        if (Platform.includedIn(Platform.WINDOWS_AMD64.class)) {
            resourceRegistry.addResources("\\QMETA-INF/native/netty_tcnative_windows_x86_64.dll\\E");
        }

        if (Platform.includedIn(Platform.LINUX_AMD64.class)) {
            resourceRegistry.addResources("\\QMETA-INF/native/libnetty_transport_native_epoll_x86_64.so\\E");
            resourceRegistry.addResources("\\QMETA-INF/native/libnetty_tcnative_linux_x86_64.so\\E");
        }
        if (Platform.includedIn(Platform.LINUX_AARCH64.class)) {
            resourceRegistry.addResources("\\QMETA-INF/native/libnetty_tcnative_linux_aarch_64.so\\E");
        }

        if (Platform.includedIn(Platform.DARWIN_AMD64.class)) {
            resourceRegistry.addResources("\\QMETA-INF/native/libnetty_resolver_dns_native_macos_x86_64.jnilib\\E");
            resourceRegistry.addResources("\\QMETA-INF/native/libnetty_tcnative_osx_x86_64.jnilib\\E");
            resourceRegistry.addResources("\\QMETA-INF/native/libnetty_transport_native_kqueue_x86_64.jnilib\\E");
        }
        if (Platform.includedIn(Platform.DARWIN_AARCH64.class)) {
            // nothing yet...
        }
    }

    @Override
    public void beforeAnalysis(BeforeAnalysisAccess access) {
        if (Platform.includedIn(Platform.DARWIN.class)) {
//            NativeLibrarySupport.singleton().preregisterUninitializedBuiltinLibrary("netty_resolver_dns_native_macos");
//            PlatformNativeLibrarySupport.singleton().addBuiltinPkgNativePrefix(
//                "io_netty_resolver_dns_macos_MacOSDnsServerAddressStreamProvider");

            NativeLibrarySupport.singleton().preregisterUninitializedBuiltinLibrary("netty_tcnative");
            NativeLibrarySupport.singleton().preregisterUninitializedBuiltinLibrary("io_netty_internal_tcnative_Library_netty_tcnative");
//            PlatformNativeLibrarySupport.singleton().addBuiltinPkgNativePrefix("io.netty.internal.tcnative");
            PlatformNativeLibrarySupport.singleton().addBuiltinPkgNativePrefix(
                "io_netty_internal_tcnative_SSL," +
                    "io_netty_internal_tcnative_SSLContext");

//            NativeLibrarySupport.singleton().preregisterUninitializedBuiltinLibrary("netty_transport_native_kqueue");
//            PlatformNativeLibrarySupport.singleton().addBuiltinPkgNativePrefix(
//                "io_netty_channel_kqueue_BsdSocket," +
//                    "io_netty_channel_kqueue_Native," +
//                    "io_netty_channel_kqueue_KQueueEventArray," +
//                    "io_netty_channel_kqueue_KQueueStaticallyReferencedJniMethods");

            NativeLibraries nativeLibraries = ((FeatureImpl.BeforeAnalysisAccessImpl) access).getNativeLibraries();
//            nativeLibraries.addStaticJniLibrary("netty_resolver_dns_native_macos");
            nativeLibraries.addStaticJniLibrary("netty_tcnative");
            nativeLibraries.addStaticJniLibrary("io_netty_internal_tcnative_Library_netty_tcnative");
//            nativeLibraries.addStaticJniLibrary("netty_transport_native_kqueue");

//            nativeLibraries.addDynamicNonJniLibrary("libnetty_tcnative_osx_x86_64.jnilib");
//            nativeLibraries.addDynamicNonJniLibrary("libnetty_transport_native_kqueue_x86_64.jnilib");
        }
    }
}

As you can see in https://github.com/netty/netty-tcnative/blob/main/openssl-dynamic/src/main/java/io/netty/internal/tcnative/Library.java#L137, there is interesting prefixing that is occurring too, so that is why I tried, e.g. netty_tcnative and the longer form.

chanseokoh commented 3 years ago

Here's an interesting thing:

I also believe that System.loadLibrary() in a native image will still require to dynamically load a shared library from a runtime environment (unless you do some special Feature step to statically link a native library), whether the image is statically linked (native-image --static) or not.

However, fortunately netty has its own special logic to load a native library, and this applies to the epoll library (not 100% sure about tcnative though). As said earlier, the netty JAR already embeds the shared libraries in it, and there is a reason for that. When it doesn't load a library from a system, it opens the embedded .so file (assumed to be in META-INF/native) as a resource stream (after determining the right binary based on the current OS and arch), creates a temp file, writes the contents of the embedded .so file to it, and loads that temp file. And this logic still runs and works in the AOT-compiled native image.

For example, if you pass -Dio.netty.native.deleteLibAfterLoading=false when running a native image, you will be able to verify that these temp files have been created. The operations can also be observed from the strace output.

$ strace -e open,openat target/com.example.driver -Dio.netty.native.deleteLibAfterLoading=false
...
openat(AT_FDCWD, "/home/chanseok/tmp/gitrepos/native-lib-test/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_64.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
...
openat(AT_FDCWD, "/tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_641025032029681471214.so", O_RDWR|O_CREAT|O_EXCL, 0666) = 9
openat(AT_FDCWD, "/tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_641025032029681471214.so", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 9
openat(AT_FDCWD, "/tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_641025032029681471214.so", O_RDONLY|O_CLOEXEC) = 9
...
openat(AT_FDCWD, "/home/chanseok/tmp/gitrepos/native-lib-test/libio_grpc_netty_shaded_netty_transport_native_epoll.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
...
(... the same sequence repeats for libio_grpc_netty_shaded_netty_tcnative ...)
...
$ ls /tmp/*epoll*.so                                                                                                                                
/tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_641025032029681471214.so   

But the native image fails to load the library, unfortunately. I saw the following error at one point while I was debugging, and the JNI version returned as -1 makes me think that it's failing inside JNI_OnLoad in the epoll native code for some reason. The log message also confirms that the native code was executed from the temp .so file.

Exception in thread "main" java.lang.UnsatisfiedLinkError: Unsupported JNI version 0xffffffff, required by /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_6410688026722735409597.so
dzou commented 3 years ago

Nice discoveries @chanseokoh.

Exception in thread "main" java.lang.UnsatisfiedLinkError: Unsupported JNI version 0xffffffff, required by /tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_6410688026722735409597.so

Yeah I encountered this error too. I tested this code out today and made a mini-reproducer of the code snippet @JonathanGiles provided. I also dug up the old jni-config.json that we used to have in this repo.

Not sure if the discovery is useful, but I found out that if we add the jni-config.json to the reproducer, it gets the native library to load properly.

One can try with mvn package -Pnative and then ./com.example.driver in the project to compile with GraalVM and then run to verify. Probably for macOS you will need to modify the resource-config.json so it adds the shared libs you need.

May 12, 2021 6:42:20 PM com.example.Driver main
INFO: Epoll.isAvailable(): true
May 12, 2021 6:42:20 PM com.example.Driver main
INFO: OpenSsl.isAvailable(): true
May 12, 2021 6:42:21 PM com.example.Driver main
INFO: SSL Context is io.grpc.netty.shaded.io.netty.handler.ssl.OpenSslServerContext@70cdad0d, class is class io.grpc.netty.shaded.io.netty.handler.ssl.OpenSslServerContext

But yeah, now I am kind of curious what is the relationship between the jni-config.json and this stuff being done in the Features.

I'll give the Features a try next. Yeah I see, it kind of seems silly to add the shared lib as a resource in the image for it to be later dynamically linked rather than try to just statically link at build time.

I also believe that System.loadLibrary() in a native image will still require to dynamically load a shared library from a runtime environment (unless you do the special Feature step), whether the image is statically linked (native-image --static) or not.

I was curious what this means @chanseokoh; is dynamic linking unavoidable if you use System.loadLibrary? Or if you did the prework in the feature to statically link then does it work transparently? Hm, maybe we must try to find out...

The code I ended up with (with a number of different experiments clearly showing) is below. I've not cleaned it up as it gives some indication to the paths I went down. Note the first method is something I want to remove, when the resources are statically linked.

@JonathanGiles - I see. Does your solution already work without the duringSetup step (or was that just for illustration purposes)? Or were there still blockers that you had?

Also general tip: You will be able to see at a low-level what is going on in Netty if you turn your log levels to debug. This can be done by pasting this code at the top of your main method:

    Logger root = Logger.getLogger("");
    root.setLevel(Level.FINEST);
    root.addHandler(new StreamHandler(System.out, new SimpleFormatter()));

    for (Handler handler : root.getHandlers()) {
      handler.setLevel(Level.FINEST);
    }
JonathanGiles commented 3 years ago

With that logging hint, I can see on macos I am getting the following error reported:

Epoll.isAvailable(): false
May 13, 2021 4:04:22 PM io.netty.util.internal.NativeLibraryLoader loadLibrary
FINE: Successfully loaded the library netty_tcnative_osx_x86_64
May 13, 2021 4:04:22 PM io.netty.handler.ssl.OpenSsl <clinit>
FINE: Initialize netty-tcnative using engine: 'default'
May 13, 2021 4:04:22 PM io.netty.handler.ssl.OpenSsl <clinit>
FINE: Failed to initialize netty-tcnative; OpenSslEngine will be unavailable. See https://netty.io/wiki/forked-tomcat-native.html for more information.
java.lang.UnsatisfiedLinkError: io.netty.internal.tcnative.Library.aprMajorVersion()I [symbol: Java_io_netty_internal_tcnative_Library_aprMajorVersion or Java_io_netty_internal_tcnative_Library_aprMajorVersion__]
        at com.oracle.svm.jni.access.JNINativeLinkage.getOrFindEntryPoint(JNINativeLinkage.java:153)
        at com.oracle.svm.jni.JNIGeneratedMethodSupport.nativeCallAddress(JNIGeneratedMethodSupport.java:57)
        at io.netty.internal.tcnative.Library.aprMajorVersion(Library.java)
        at io.netty.internal.tcnative.Library.initialize(Library.java:149)
        at io.netty.handler.ssl.OpenSsl.initializeTcNative(OpenSsl.java:597)
        at io.netty.handler.ssl.OpenSsl.<clinit>(OpenSsl.java:153)
        at com.oracle.svm.core.classinitialization.ClassInitializationInfo.invokeClassInitializer(ClassInitializationInfo.java:375)
        at com.oracle.svm.core.classinitialization.ClassInitializationInfo.initialize(ClassInitializationInfo.java:295)
kkriske commented 3 years ago

I was curious what this means; is dynamic linking unavoidable if you use System.loadLibrary? Or if you did the prework in the feature to statically link then does it work transparently? Hm, maybe we must try to find out...

With the right Feature setup, the System.loadLibrary call will know that the library you want is statically linked, and no dynamic libs will be required.

chanseokoh commented 3 years ago

Yeah I see, it kind of seems silly to add the shared lib as a resource in the image for it to be later dynamically linked rather than try to just statically link at build time.

Statically linking libraries at compile-time: I believe theoretically this is doable in our case, but there are also hurdles. In the end, it all boils down to using the linker ld, so the arguments below not only applies to JNI in Java but to any languages.

Java is portable, but these native libraries are not. This is why grpc-netty-shaded embeds many platform-specific libraries, e.g., for Windows, Mac, Linux, x86_64, aarch64, etc.

$ ls -1 grpc-netty-shaded/META-INF/native
io_grpc_netty_shaded_netty_tcnative_windows_x86_64.dll
libio_grpc_netty_shaded_netty_tcnative_linux_aarch64.so
libio_grpc_netty_shaded_netty_tcnative_linux_x86_64.so
libio_grpc_netty_shaded_netty_tcnative_osx_x86_64.jnilib
libio_grpc_netty_shaded_netty_transport_native_epoll_x86_64.so

Therefore, to make the Java Netty look like it's portable like magic, they put a special logic to determine which library file to load based on the OS and arch (e.g., here and here).

Anyways, this mechanism is dynamic loading of shared libraries.

If you want to statically link libraries, then you need to link them at build-time, of course. That means, you will also need a special logic to choose the right binary based on the platform where you build. (It's not difficult to do so though. The code just needs to be slightly more sophisticated.)

However, to do static linking, you need static libraries (i.e., .a files, not .so). Unless someone periodically pre-compiles and releases static libraries, you'll have to build them from source (e.g., from these epoll .c files) yourself on some popular platforms at least.

With the right Feature setup, the System.loadLibrary call will know that the library you want is statically linked, and no dynamic libs will be required.

I confirm this too. This matches my observation.

dzou commented 3 years ago

With that logging hint, I can see on macos I am getting the following error reported:

Epoll.isAvailable(): false
May 13, 2021 4:04:22 PM io.netty.util.internal.NativeLibraryLoader loadLibrary
FINE: Successfully loaded the library netty_tcnative_osx_x86_64
May 13, 2021 4:04:22 PM io.netty.handler.ssl.OpenSsl <clinit>
FINE: Initialize netty-tcnative using engine: 'default'
May 13, 2021 4:04:22 PM io.netty.handler.ssl.OpenSsl <clinit>
FINE: Failed to initialize netty-tcnative; OpenSslEngine will be unavailable. See https://netty.io/wiki/forked-tomcat-native.html for more information.
java.lang.UnsatisfiedLinkError: io.netty.internal.tcnative.Library.aprMajorVersion()I [symbol: Java_io_netty_internal_tcnative_Library_aprMajorVersion or Java_io_netty_internal_tcnative_Library_aprMajorVersion__]
        at com.oracle.svm.jni.access.JNINativeLinkage.getOrFindEntryPoint(JNINativeLinkage.java:153)
        at com.oracle.svm.jni.JNIGeneratedMethodSupport.nativeCallAddress(JNIGeneratedMethodSupport.java:57)
        at io.netty.internal.tcnative.Library.aprMajorVersion(Library.java)
        at io.netty.internal.tcnative.Library.initialize(Library.java:149)
        at io.netty.handler.ssl.OpenSsl.initializeTcNative(OpenSsl.java:597)
        at io.netty.handler.ssl.OpenSsl.<clinit>(OpenSsl.java:153)
        at com.oracle.svm.core.classinitialization.ClassInitializationInfo.invokeClassInitializer(ClassInitializationInfo.java:375)
        at com.oracle.svm.core.classinitialization.ClassInitializationInfo.initialize(ClassInitializationInfo.java:295)

To fix these you'll have to manually iterate and add the classes found in the UnsatisfiedLinkError (like the io.netty.internal.tcnative.Library.aprMajorVersion) to the jni-config.json. And just keep trying rebuild/update until it works.

Based on what @chanseokoh said, I'm not sure if static linking is possible then. In the past I went down the rabbit hole of trying to get the native libraries to load at image build time (this can be done by reverting the native-image.properties and setting the OpenSsl and ssl-related classes to be initialized at build-time). But I was never able to get this to work; I'd always get some sort of seg-fault error at runtime. So maybe we're stuck with just loading libraries dynamically at runtime.

chanseokoh commented 3 years ago

Probably for macOS you will need to modify the resource-config.json so it adds the shared libs you need.

In my testing, the sample works without this particular file. Seems unnecessary.

In the past I went down the rabbit hole of trying to get the native libraries to load at image build time

I'm just speculating, but this might be relevant. But to me, it doesn't make much sense that dynamic loading of shared libraries can be accomplished at build time that makes actual calls work at runtime. (Does "loading" mean somehow embedding platform-specific function code into a native image? But that's no longer dynamic loading.)

chanseokoh commented 3 years ago

Oh, one more thing. I've never used Mac, so didn't realize this before: epoll is not available on Mac. (The Stack Overflow answer is a bit old, but I think this is still the case.) @JonathanGiles so it's likely that you'll never get true from Epoll.isAvailable() on your Mac? Anyways, one of the answers also seems to suggest that Netty has kqueue support instead of epoll for Mac to some degree? (But I wonder why grpc-netty-shaded doesn't have a shared library for that.)

JonathanGiles commented 3 years ago

There is a tcnative-kqueue that I do use, but you are right - I never expected to see epoll as true on my mac. I'm running out of things to try to get this to work - I suspect for now we just will end up falling back to using the JDK implementations (which are drastically slower) until a better solution is found in the graalVM / netty projects.

dzou commented 3 years ago

Hey there, I will look into how to enable this correctly at least for Linux.

Regarding time, I'll try to get to this, but because this does not block the functionality of the project it is a little bit lower priority for us right now as we try to expand our coverage fully for all GCP libraries.

JonathanGiles commented 3 years ago

Unrelatedly, I'm curious how you validate that your support for graalvm is complete. In other words, what testing strategies do you use to ensure that a native app will work completely the same as a Java version of the app?

dzou commented 3 years ago

Good question. We currently maintain a set of sample applications which exercise different GCP client libraries and will build and run these in our CI. This is our form of validation in the early stages (which you can see is limited by the code paths you can cover).

In the long run, we would like to build the unit tests of all the client libraries as a native image and validate the tests by running the test images. This work is a GA blocker for us and we're looking into how we could do this.

mpeddada1 commented 1 year ago

Closing as part of clean up