deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.07k stars 650 forks source link

Allowing unloading the native libraries #1421

Closed carlosuc3m closed 1 year ago

carlosuc3m commented 2 years ago

Description

I am working on an application that allows switching between different pytorch versions dynamically. In order to do so I load dynamically the JARs needed fo each particular version on a child classloader of the main classloader. However I am not able to switch between versions because the child classloader is never garbage collected so the native library is never unloaded and loading a two native libraries even of different versions causes errors. Can you think of any workaround to tackle this issue.

frankfliu commented 2 years ago

You have to use custom ClassLoader to load DJL (more specifically Engine class). And you have to make they are not loaded by system classloader.

Once the custom ClassLoader object is garbage collected, you can load a different version of engine.

carlosuc3m commented 2 years ago

HOw do I avoid loading by the system ClassLoader, as far as I understand, DJL loads the engines in the Thread ClassLoader: https://github.com/deepjavalibrary/djl/blob/3fce3aa58dc252b1b33efae04c6b6e37df1ba1a9/api/src/main/java/ai/djl/engine/Engine.java#L61

carlosuc3m commented 2 years ago

The following piece of code is an example of dynamically loading the framework and how I am not able to GArbage Collect the new classloader:

// To check which native libraries have been loaded
Field LIBRARIES = ClassLoader.class.getDeclaredField("loadedLibraryNames");
LIBRARIES.setAccessible(true);
final Vector<String> libraries = (Vector<String>) LIBRARIES.get(Thread.currentThread().getContextClassLoader());
// Original ClassLoader
ClassLoader ogCl = Thread.currentThread().getContextClassLoader();
// Load JARs to new classloader
URL[] urls = new URL[new File(jarsDirectory).listFiles().length];
int c = 0;
for (File ff : new File(jarsDirectory).listFiles()) {
    urls[c ++] = ff.toURI().toURL();
}
URLClassLoader engineClassloader = new URLClassLoader(urls, null);
// Set the new ClassLoader as Thread ClassLoader
Thread.currentThread().setContextClassLoader(engineClassloader);
// Execute a simple command
Class<?> clM = engineClassloader.loadClass("ai.djl.ndarray.NDManager");
Method mm = clM.getMethod("newBaseManager");
Object manager = mm.invoke(null);
// Delete references to every object in the ClassLoader
clM = null;
mm = null;
manager = null;
// Set ClassLoader back
Thread.currentThread().setContextClassLoader(ogCl);
engineClassloader = null;
// Call Garbage collector
System.gc();
// Check loaded Native libraries, which are not the same
// as the original ones, Pytorch is still loaded
final Vector<String> libraries2 = (Vector<String>) LIBRARIES.get(Thread.currentThread().getContextClassLoader());

What do you think? My only idea currently is to workaround DJL code and load the native library from a classloader that is not the Thread classloader doing something like the following:

URLClassLoader engineClassloader = new URLClassLoader(urls, ogCl);

Class<?> enginePt = engineClassloader.loadClass("ai.djl.pytorch.engine.PtEngine");
//Object enginePt = engineCl..newInstance();

Class<?> engineCl = engineClassloader.loadClass("ai.djl.engine.EngineProvider");
ServiceLoader<?> loaders = ServiceLoader.load(engineCl, engineClassloader);
Method mm = engineCl.getMethod("getEngine");
Object engine = null;
for (Object ll : loaders) {
    try {
        engine = mm.invoke(ll);
    } catch (Exception ex) {
    }
}

Taking into account that getEngine also look at the resources loaded in the Thread classloader. I really dont know if i am missing something so thank you very much for your time.

frankfliu commented 2 years ago

@carlosuc3m You solution won't work work:

  1. URLClassLoader by default will use system classloader first, if the jar file in the classpath, system ClassLoader will always kick in. To prevent this happen, you have to implement your own ClassLoader, and use it to load all DJL classes, not just Engine (NDManager.class loaded by system may not work with Engine.class loaded by your ClassLoader)
  2. System.gc() may not kick in immediately, there is no guarantee ClassLoader will be gced after this call

You might want to consider use OSGi (might be overkill) for your use case.

frankfliu commented 2 years ago

@carlosuc3m just out of curiosity, what's the use case you need run multiple pytorch version?

carlosuc3m commented 2 years ago

Thank you for you answer @frankfliu The JAR files corresponding to DJL are all in a directory that it is not in the classpath, so in theory they should not be loaded by the System ClassLoaer, should they? Do I have to make: customClassLoader.loadClass("ai.djl.ndarray.NDManager") for every class in all the JAR files? And still how do you work around the calls to the Thread.currentThread().getContextClassLoader() that happen when loading the engine in: https://github.com/deepjavalibrary/djl/blob/3fce3aa58dc252b1b33efae04c6b6e37df1ba1a9/api/src/main/java/ai/djl/engine/Engine.java#L62 and https://github.com/deepjavalibrary/djl/blob/3fce3aa58dc252b1b33efae04c6b6e37df1ba1a9/api/src/main/java/ai/djl/util/Platform.java#L62

I am developing an application that is able to load pretrained models of Deep Learning. For that it should be able to change between Deep Learning engines dynamically depending on the model selected. The plugin is oriented towards users not familiar at all with Deep Learning or even programming, that is why all of this should happen on the backend without the user knowing. Regards, Carlos

frankfliu commented 2 years ago

@carlosuc3m If the whole application (including djl jars) are not in the classpath it should work, but you need to explicitly set contextClassLoader. I created a test application, which load jars from DJL example module:

  1. model examples/build.gradle, to enable tasks.distZip.enabled = true
  2. build example jars
    cd examples
    ./gradlew dZ
    unzip build/distributions/examples-0.15.0-SNAPSHOT.zip
public final class ClassLoaderTest {

    private ClassLoaderTest() {
    }

    public static void main(String[] args) throws Exception {
        Path path = Paths.get("examples/examples-0.15.0-SNAPSHOT/lib");
        URL[] urls = Files.list(path).map(p -> {
                    try {
                        if (p.toString().endsWith(".jar")) {
                            return p.toUri().toURL();
                        }
                    } catch (IOException e) {
                        return null;
                    }
                    return null;
                }
        ).filter(Objects::nonNull).toArray(URL[]::new);

        test(urls);

        for (int i = 0;i < 10; ++i) {
            System.gc();
            Thread.sleep(1000);
        }

        test(urls);
    }

    public static void test(URL[] urls) throws ReflectiveOperationException {
        URLClassLoader cl = new URLClassLoader(urls);

        Thread.currentThread().setContextClassLoader(cl);
        Class<?> clazz = cl.loadClass("ai.djl.examples.inference.ObjectDetection");
        Method method = clazz.getDeclaredMethod("predict");
        method.invoke(null);
        Thread.currentThread().setContextClassLoader(null);
    }
}
carlosuc3m commented 2 years ago

Yes, in that case it works as it is loading two times the same classloader, and the native libraries of both classloaders coincide. However, if the classloaders need to load two different native libraries an error will appear

frankfliu commented 2 years ago

@carlosuc3m

Based on my test, the native library is unloaded and reloaded successfully. However, the inference failed when I try PyTorch 1.10.0 and 1.9.1:

libc++abi: terminating with uncaught exception of type c10::Error: Tried to register multiple backend fallbacks for the same dispatch key AutogradCUDA; previous registration registered at ../aten/src/ATen/BatchingRegistrations.cpp:1016, new registration registered at ../aten/src/ATen/ConjugateFallback.cpp:18

It looks like PyTorch cannot unload the shared library cleanly. MXNet seems works fine. I don't think this can be resolved by ClassLoader.

carlosuc3m commented 2 years ago

Ok thank you for your time @frankfliu! The problematic native file seemed to be libtorch_cpu.so and it seems that loading two of them is not possible as per https://github.com/pytorch/pytorch/issues/70191 REgards and thank you for your time

saudet commented 2 years ago

@carlosuc3m JavaCPP implements a hack to allow this to work, so your use case works with the JavaCPP Presets for PyTorch: https://github.com/bytedeco/javacpp-presets/tree/master/pytorch

@frankfliu Please consider doing something like JavaCPP to accommodate users of containers like Tomcat, OSGi, etc.

carlosuc3m commented 2 years ago

Hello again @frankfliu , I am still wrking with this issue. Do you know the order loading the .so files in Linux. In Windows is specified in the ai.djl.pytorch.jni.LibUtils class code but for Linux it seems that it only loads the libdjl_torch.so. Does this native library loads the rest of the code? Regards, Carlos

frankfliu commented 2 years ago

@carlosuc3m For PyTorch 1.9.1 and earlier, you just need to load libdjl_torch.so file (you must put this file in the same folder as libtorch.so).

In PyTorch 1.10.0, we manually load .so file in the following order:

  1. All files that not contains "torch", "caffe2" and "cudnn"
  2. Load PyTorch specific .so file in the following order:
    • libfbgemm
    • libcaffe2_nvrtc
    • libtorch_cpu
    • libc10_cuda
    • libtorch_cuda_cpp
    • libtorch_cuda_cu
    • libtorch_cuda
    • libtorch
    • libdjl_torch

see: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/src/main/java/ai/djl/pytorch/jni/LibUtils.java#L94

carlosuc3m commented 2 years ago

Hello again, I am still working on this. I have observed that in Linux some native libraries are impossible to unload. However in windows the library that causes the problem is the jni dll. Do you think it can be solved with any workaround? I've tried breaking reflection to force the unload of native libraries but I would like to avoid it. I also created an issue in stackoverflow: https://stackoverflow.com/questions/70682562/jni-native-library-avoids-garbage-collection-and-unloading Thank you for your time, Carlos

frankfliu commented 2 years ago

@carlosuc3m I don't really know why it's not unloaded, maybe you can use C++ code try to load and unload the share library to see what will happen.

carlosuc3m commented 2 years ago

Yes, I ahve tried dlopen and dlclose already and it did not work

frankfliu commented 1 year ago

Closing this issue for now since there isn't much we can do. Feel free to reopen this issue if you have new idea.