deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.07k stars 650 forks source link

pytorch-engine:0.18.0 causes memory leak when using NDManager.newBaseManager() #1886

Closed 925781609 closed 2 years ago

925781609 commented 2 years ago

Description

  1. When using pytorch-engine:0.18.0 NDManager.newBaseManager() creates a PtNDManager, it will call ai.djl.pytorch.engine.PtNDManager#newSubManager, and execute:

    PtNDManager manager = new PtNDManager(this, device);
    attachUncappedInternal(manager.uid, manager);
    return manager;
  2. Method attachUncappedInternal is implemented by BaseNDManager and attaches the created PtNDManager to its field resources.

    resources.put(resourceId, resource);
  3. The created PtNDManger will never be released even it is closed.

    
    public void close() {
        if (!closed.getAndSet(true)) {
             // ignore some code
            parent.detachInternal(uid);
            resources.clear();
            tempResources.clear();
        }
    }
The `parent` is `PtNDManager$SystemManager` and  parent's `detachInternal` does nothing.
```java
@Override
 public void detachInternal(String resourceId) {}

So in the end, the created PtNDManger will not be sweeped by JVM GC.

  1. When downgrade pytorch-engine to version 0.17.0, the problem is solved. Because the newSubManager calls PtNDManager$SystemManger#attachInternal. PtNDManager$SystemManger#attachInternal does nothing.
    PtNDManager manager = new PtNDManager(this, device);
    attachInternal(manager.uid, manager);
    return manager;
 @Override
  public void attachInternal(String resourceId, AutoCloseable resource) {}

Expected Behavior

The SystemManager will not attach the created PtNDManger to its field resources or release PtNDManger when it is closed.

Error Message

image

How to Reproduce?

  1. use pytorch-engine version 0.18.0
  2. execute the code below as many times as possible and will cause OOM eventually.
    try (NDManager manager = NDManager.newBaseManager(Device.cpu())) {
    // do something here
    }
  3. maven dependencies

        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
            <version>0.18.0</version>
            <exclusions>
                <exclusion>
                    <artifactId>jna</artifactId>
                    <groupId>net.java.dev.jna</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>5.9.0</version>
        </dependency>
    
        <!--For Pre-CXX11 build -->
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cpu-precxx11</artifactId>
            <classifier>linux-x86_64</classifier>
            <version>1.11.0</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.11.0-0.18.0</version>
            <scope>runtime</scope>
        </dependency>
        <!-- windows -->
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cpu</artifactId>
            <classifier>win-x86_64</classifier>
            <scope>runtime</scope>
            <version>1.11.0</version>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.11.0-0.18.0</version>
            <scope>runtime</scope>
        </dependency
lanking520 commented 2 years ago

Thanks for your fix contribution, we will track on that