deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.05k stars 648 forks source link

JVM内存没有释放 #2849

Open 201723201401012 opened 9 months ago

201723201401012 commented 9 months ago

Description

在使用OnnxRuntime引擎不会回收jvm内存

下面是代码,已经有很多使用者提出内存不会回收,不知道你们为啥这么自信你们一直没有问题,我们已经转到python在生产环境。你们有时间看下吧

mport ai.djl.Device;
import ai.djl.MalformedModelException;
import ai.djl.ModelException;
import ai.djl.inference.Predictor;
import ai.djl.modality.cv.Image;
import ai.djl.modality.cv.ImageFactory;
import ai.djl.modality.cv.output.DetectedObjects;
import ai.djl.modality.cv.transform.Resize;
import ai.djl.modality.cv.translator.YoloV5Translator;
import ai.djl.modality.cv.translator.YoloV5TranslatorFactory;
import ai.djl.repository.zoo.Criteria;
import ai.djl.repository.zoo.ModelNotFoundException;
import ai.djl.repository.zoo.ZooModel;
import ai.djl.translate.TranslateException;
import com.fpi.bmp.algorithm.management.translator.YoloV5RelativeTranslator;
import com.fpi.bmp.algorithm.management.util.ModelUrlUtil;
//import org.opencv.core.Mat;

import java.io.IOException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

/**
 * @Author: cc
 * @Date: 2023/10/26 4:49 PM
 * @Description:
 */
public class Main {
    private Predictor<Image, DetectedObjects> predictor;
    private ZooModel<Image, DetectedObjects> model;
    private Criteria<Image, DetectedObjects> criteria;

    public static void main(String[] args) throws ModelException, IOException, TranslateException {
        Main p = new Main();
        ExecutorService executorService = Executors.newFixedThreadPool(100);
        for (int i = 0; i < 300; i++) {
            executorService.execute(() -> {
                try {
                    p.detect("/file-base-server/api/v1/sys/download/10e7d36d8c03499ba904e53df39e1eb0");
                } catch (IOException e) {
                    e.printStackTrace();
                } catch (TranslateException e) {
                    e.printStackTrace();
                }
                System.out.println("iiii===" + Thread.currentThread().getName());
            });
        }
        p.detect("/file-base-server/api/v1/sys/download/10e7d36d8c03499ba904e53df39e1eb0");

        System.out.println("main=" + Thread.currentThread().getName());
    }

    public Main() throws ModelNotFoundException, MalformedModelException, IOException {
        Criteria<Image, DetectedObjects> criteria = Criteria.builder()
                .setTypes(Image.class, DetectedObjects.class)
                .optModelUrls(ModelUrlUtil.getRealUrl("/model/smoke/onnx.zip"))
                .optArgument("width", "640")
                .optArgument("height", "640")
                .optArgument("resize", "true")
                .optArgument("rescale", "true")
                .optArgument("optApplyRatio", "true")
                .optArgument("threshold", "0.4")
                .optTranslatorFactory(new YoloV5TranslatorFactory())
                .optModelName("smoke.onnx")
                .optEngine("OnnxRuntime")
                .build();
        model = criteria.loadModel();
        predictor = model.newPredictor();
    }

    public void detect(String imgPath) throws IOException, TranslateException {
        Image img = ImageFactory.getInstance().fromUrl(imgPath);
        long starTime = System.currentTimeMillis();
        try {
            DetectedObjects predict = predictor.predict(img);
            long endTime = System.currentTimeMillis();
            System.out.println(Thread.currentThread().getName() + " 模型图片推理时间=" + (endTime - starTime));
            System.out.println(predict);
        } catch (Exception e){

        } finally {
//            ((Mat) img.getWrappedImage()).release();
        }

    }
}

onnx.zip

201723201401012 commented 9 months ago

PyTorch引擎会回收

frankfliu commented 9 months ago

@201723201401012 I ran your code and I didn't see any error, no OOM, the max memory usage is about 6G on my machine.

201723201401012 commented 9 months ago

@201723201401012 I ran your code and I didn't see any error, no OOM, the max memory usage is about 6G on my machine.

我测试也是维持6G左右,但是这个6G并不会回收,你连续压测内存还会上涨,直到oom,你可以连续并发100次压测,你将会看到问题

hfwanggh commented 9 months ago

@201723201401012 I ran your code and I didn't see any error, no OOM, the max memory usage is about 6G on my machine.

请问,能不能回复一下我的问题(就在当前这位作者的下面就是我的提问),目前代码运行在CentOS8服务器上,我使用接口请求工具进行测试了,如果不是多线程并发请求,内存波动基本平稳不是特别大,但是只要是并发请求,假设10个,内存就会有数十GB的增长,随着并发请求的次数增多内存也是一直在增长,如果只是一瞬间有特别大的内存占用也没什么问题,服务器内存空间完全够用,但是问题就是并发请求完了内存不释放,有个两三次并发请求,内存占用就能超过100GB。我想知道paddleOCR是我哪里使用的不对吗,你们官方在开发测试过程中使用的都是什么环境,比如:是不是基于Spring?JDK版本?打包和部署方式?还是我的代码有问题?谢谢!

201723201401012 commented 9 months ago

@201723201401012 I ran your code and I didn't see any error, no OOM, the max memory usage is about 6G on my machine.

请问,能不能回复一下我的问题(就在当前这位作者的下面就是我的提问),目前代码运行在CentOS8服务器上,我使用接口请求工具进行测试了,如果不是多线程并发请求,内存波动基本平稳不是特别大,但是只要是并发请求,假设10个,内存就会有数十GB的增长,随着并发请求的次数增多内存也是一直在增长,如果只是一瞬间有特别大的内存占用也没什么问题,服务器内存空间完全够用,但是问题就是并发请求完了内存不释放,有个两三次并发请求,内存占用就能超过100GB。我想知道paddleOCR是我哪里使用的不对吗,你们官方在开发测试过程中使用的都是什么环境,比如:是不是基于Spring?JDK版本?打包和部署方式?还是我的代码有问题?谢谢!

你的问题基本和我一样,并不会回收,我这个问题如果用PyTorch引擎内存会回收,但是pytorch引擎在aarch64下性能非常低,有十倍的差距

hfwanggh commented 9 months ago

@201723201401012 I ran your code and I didn't see any error, no OOM, the max memory usage is about 6G on my machine.

请问,能不能回复一下我的问题(就在当前这位作者的下面就是我的提问),目前代码运行在CentOS8服务器上,我使用接口请求工具进行测试了,如果不是多线程并发请求,内存波动基本平稳不是特别大,但是只要是并发请求,假设10个,内存就会有数十GB的增长,随着并发请求的次数增多内存也是一直在增长,如果只是一瞬间有特别大的内存占用也没什么问题,服务器内存空间完全够用,但是问题就是并发请求完了内存不释放,有个两三次并发请求,内存占用就能超过100GB。我想知道paddleOCR是我哪里使用的不对吗,你们官方在开发测试过程中使用的都是什么环境,比如:是不是基于Spring?JDK版本?打包和部署方式?还是我的代码有问题?谢谢!

你的问题基本和我一样,并不会回收,我这个问题如果用PyTorch引擎内存会回收,但是pytorch引擎在aarch64下性能非常低,有十倍的差距

是了,就是这个问题,我测试每间隔几秒钟发送一个请求一共发100个,内存波动感觉还算平稳,但是如果是并发请求内存问题就很明显了,即使是几个或者10个以上的并发请求就已经比较直观的能看出来内存不释放的问题了。

frankfliu commented 9 months ago
  1. I tested your code with 300,00 iteration (takes 20 minutes), no memory leak,
  2. The memory consumption is related to number threads, in your test you are using 100 threads.
  3. We don't have control of native memory, it's up to PyTorch or OnnxRuntime, and usually use memory pool and will not release back to os. But this doesn't mean there is memory leak.
  4. You should not run your test code 100 times concurrently, which actually using 100 * 100 threads.
  5. In our test aarch64 (c7g) actually a bit faster than cpu (c5), most our AWS customers has switched to c7g
201723201401012 commented 9 months ago
  1. I tested your code with 300,00 iteration (takes 20 minutes), no memory leak,
  2. The memory consumption is related to number threads, in your test you are using 100 threads.
  3. We don't have control of native memory, it's up to PyTorch or OnnxRuntime, and usually use memory pool and will not release back to os. But this doesn't mean there is memory leak.
  4. You should not run your test code 100 times concurrently, which actually using 100 * 100 threads.
  5. In our test aarch64 (c7g) actually a bit faster than cpu (c5), most our AWS customers has switched to c7g

那我就不知道了,我们放到线上内存会一直增加,然后oom到容器重启,我们测试了很多环境都有这个问题,可能我们系统少什么组件吧,但我想说的是不只我一个人遇到内存泄漏的问题,我身边人用也会出现这个问题,我想你们应该重视一下,我们已经换到python

frankfliu commented 9 months ago

what error do you see? When you say OOM, it java heap memory error or native memory error?

201723201401012 commented 9 months ago

what error do you see? When you say OOM, it java heap memory error or native memory error?

java

frankfliu commented 9 months ago

If you see Java OutOfMemeoryError, most likely it's not related to DJL. DJL use very little heap memory. The majority of memory is native memory used by PyTorch and OnnxRuntime engine. In our DJLServing solution in set -Xmx2g

kkangert commented 5 months ago

我也是这个问题。

frankfliu commented 5 months ago

@kkangert

Do you have a project that can reproduce this issue?

kkangert commented 5 months ago

@kkangert

Do you have a project that can reproduce this issue?

和这个issue是一样的代码测试的。