deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.06k stars 648 forks source link

OOME for https://github.com/kingyuluk/RL-FlappyBird/ on Windows & Ubuntu #537

Open nimishjain15882 opened 3 years ago

nimishjain15882 commented 3 years ago

Description

I tried running https://github.com/kingyuluk/RL-FlappyBird/ on Windows & Ubuntu. However, after sometime both throw Out Of Memory Error during training. Ruducing batch size doesn't help either. It appears there is a memory leak.

Expected Behavior

The training should continue.

Error Message

[pool-1-thread-2] ERROR ai.djl.ndarray.BaseNDManager - Resource close failed. [main] ERROR com.kingyu.rlbird.ai.TrainBird - java.util.concurrent.ExecutionException: ai.djl.engine.EngineException: MXNet engine call failed: cuDNN: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) : CUDNN_STATUS_EXECUTION_FAILED Stack trace: File "src/operator/nn/./cudnn/cudnn_convolution-inl.h", line 155

at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at com.kingyu.rlbird.ai.TrainBird.train(TrainBird.java:113) at com.kingyu.rlbird.ai.TrainBird.main(TrainBird.java:65)

Caused by: ai.djl.engine.EngineException: MXNet engine call failed: cuDNN: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) : CUDNN_STATUS_EXECUTION_FAILED Stack trace: File "src/operator/nn/./cudnn/cudnn_convolution-inl.h", line 155

at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808) at ai.djl.mxnet.jna.JnaUtils.waitToRead(JnaUtils.java:459) at ai.djl.mxnet.engine.MxNDArray.close(MxNDArray.java:1549) at ai.djl.mxnet.engine.MxNDArrayIndexer.get(MxNDArrayIndexer.java:52) at ai.djl.ndarray.index.NDArrayIndexer.get(NDArrayIndexer.java:73) at ai.djl.ndarray.NDArray.get(NDArray.java:498) at ai.djl.ndarray.NDArray.get(NDArray.java:522) at com.kingyu.rlbird.rl.agent.QAgent.chooseAction(QAgent.java:58) at com.kingyu.rlbird.rl.agent.EpsilonGreedy.chooseAction(EpsilonGreedy.java:48) at com.kingyu.rlbird.game.FlappyBird.runEnvironment(FlappyBird.java:104) at com.kingyu.rlbird.ai.TrainBird$GeneratorCallable.call(TrainBird.java:174) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: cudaMalloc retry failed: out of memory Stack trace: File "src/storage/./pooled_storage_manager.h", line 161

at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808) at ai.djl.mxnet.jna.JnaUtils.waitToRead(JnaUtils.java:459) at ai.djl.mxnet.engine.MxNDArray.close(MxNDArray.java:1549) at ai.djl.ndarray.BaseNDManager.close(BaseNDManager.java:108) at ai.djl.mxnet.engine.MxParameterServer$OptimizerCallback.apply(MxParameterServer.java:105) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:520) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:551) at com.sun.jna.Native.invokeInt(Native Method) at com.sun.jna.Function.invoke(Function.java:426) at com.sun.jna.Function.invoke(Function.java:361) at com.sun.jna.Library$Handler.invoke(Library.java:265) at com.sun.proxy.$Proxy1.MXKVStorePushPullEx(Unknown Source) at ai.djl.mxnet.jna.JnaUtils.parameterStorePushPull(JnaUtils.java:734) at ai.djl.mxnet.engine.MxParameterServer.update(MxParameterServer.java:63) at ai.djl.training.ParameterServer.update(ParameterServer.java:38) at ai.djl.training.ParameterStore.updateAllParameters(ParameterStore.java:74) at ai.djl.training.Trainer.step(Trainer.java:190) at com.kingyu.rlbird.rl.agent.QAgent.trainBatch(QAgent.java:118) at com.kingyu.rlbird.rl.agent.EpsilonGreedy.trainBatch(EpsilonGreedy.java:56) at com.kingyu.rlbird.ai.TrainBird$TrainerCallable.call(TrainBird.java:149) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

[pool-1-thread-2] ERROR ai.djl.ndarray.BaseNDManager - Resource close failed. ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: cudaMalloc retry failed: out of memory Stack trace: File "src/storage/./pooled_storage_manager.h", line 161

at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808) at ai.djl.mxnet.jna.JnaUtils.waitToRead(JnaUtils.java:459) at ai.djl.mxnet.engine.MxNDArray.close(MxNDArray.java:1549) at ai.djl.ndarray.BaseNDManager.close(BaseNDManager.java:108) at ai.djl.mxnet.engine.MxParameterServer$OptimizerCallback.apply(MxParameterServer.java:105) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:520) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:551) at com.sun.jna.Native.invokeInt(Native Method) at com.sun.jna.Function.invoke(Function.java:426) at com.sun.jna.Function.invoke(Function.java:361) at com.sun.jna.Library$Handler.invoke(Library.java:265) at com.sun.proxy.$Proxy1.MXKVStorePushPullEx(Unknown Source) at ai.djl.mxnet.jna.JnaUtils.parameterStorePushPull(JnaUtils.java:734) at ai.djl.mxnet.engine.MxParameterServer.update(MxParameterServer.java:63) at ai.djl.training.ParameterServer.update(ParameterServer.java:38) at ai.djl.training.ParameterStore.updateAllParameters(ParameterStore.java:74) at ai.djl.training.Trainer.step(Trainer.java:190) at com.kingyu.rlbird.rl.agent.QAgent.trainBatch(QAgent.java:118) at com.kingyu.rlbird.rl.agent.EpsilonGreedy.trainBatch(EpsilonGreedy.java:56) at com.kingyu.rlbird.ai.TrainBird$TrainerCallable.call(TrainBird.java:149) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

[pool-1-thread-2] ERROR ai.djl.ndarray.BaseNDManager - Resource close failed. ai.djl.engine.EngineException: MXNet engine call failed: Name: Check failed: err == cudaSuccess (2 vs. 0) : mxnet_generic_kernel ErrStr:out of memory Stack trace: File "src/operator/././../common/../operator/mxnet_op.h", line 1121

at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808) at ai.djl.mxnet.jna.JnaUtils.waitToRead(JnaUtils.java:459) at ai.djl.mxnet.engine.MxNDArray.close(MxNDArray.java:1549) at ai.djl.ndarray.BaseNDManager.close(BaseNDManager.java:108) at ai.djl.mxnet.engine.MxParameterServer$OptimizerCallback.apply(MxParameterServer.java:105) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:520) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:551) at com.sun.jna.Native.invokeInt(Native Method) at com.sun.jna.Function.invoke(Function.java:426) at com.sun.jna.Function.invoke(Function.java:361) at com.sun.jna.Library$Handler.invoke(Library.java:265) at com.sun.proxy.$Proxy1.MXKVStorePushPullEx(Unknown Source) at ai.djl.mxnet.jna.JnaUtils.parameterStorePushPull(JnaUtils.java:734) at ai.djl.mxnet.engine.MxParameterServer.update(MxParameterServer.java:63) at ai.djl.training.ParameterServer.update(ParameterServer.java:38) at ai.djl.training.ParameterStore.updateAllParameters(ParameterStore.java:74) at ai.djl.training.Trainer.step(Trainer.java:190) at com.kingyu.rlbird.rl.agent.QAgent.trainBatch(QAgent.java:118) at com.kingyu.rlbird.rl.agent.EpsilonGreedy.trainBatch(EpsilonGreedy.java:56) at com.kingyu.rlbird.ai.TrainBird$TrainerCallable.call(TrainBird.java:149) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

[pool-1-thread-2] ERROR ai.djl.ndarray.BaseNDManager - Resource close failed. ai.djl.engine.EngineException: MXNet engine call failed: Name: Check failed: err == cudaSuccess (2 vs. 0) : mxnet_generic_kernel ErrStr:out of memory Stack trace: File "src/operator/././../common/../operator/mxnet_op.h", line 1121

at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808) at ai.djl.mxnet.jna.JnaUtils.waitToRead(JnaUtils.java:459) at ai.djl.mxnet.engine.MxNDArray.close(MxNDArray.java:1549) at ai.djl.ndarray.BaseNDManager.close(BaseNDManager.java:108) at ai.djl.mxnet.engine.MxParameterServer$OptimizerCallback.apply(MxParameterServer.java:105) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:520) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:551) at com.sun.jna.Native.invokeInt(Native Method) at com.sun.jna.Function.invoke(Function.java:426) at com.sun.jna.Function.invoke(Function.java:361) at com.sun.jna.Library$Handler.invoke(Library.java:265) at com.sun.proxy.$Proxy1.MXKVStorePushPullEx(Unknown Source) at ai.djl.mxnet.jna.JnaUtils.parameterStorePushPull(JnaUtils.java:734) at ai.djl.mxnet.engine.MxParameterServer.update(MxParameterServer.java:63) at ai.djl.training.ParameterServer.update(ParameterServer.java:38) at ai.djl.training.ParameterStore.updateAllParameters(ParameterStore.java:74) at ai.djl.training.Trainer.step(Trainer.java:190) at com.kingyu.rlbird.rl.agent.QAgent.trainBatch(QAgent.java:118) at com.kingyu.rlbird.rl.agent.EpsilonGreedy.trainBatch(EpsilonGreedy.java:56) at com.kingyu.rlbird.ai.TrainBird$TrainerCallable.call(TrainBird.java:149) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

[pool-1-thread-2] ERROR ai.djl.ndarray.BaseNDManager - Resource close failed. ai.djl.engine.EngineException: MXNet engine call failed: Name: Check failed: err == cudaSuccess (2 vs. 0) : mxnet_generic_kernel ErrStr:out of memory Stack trace: File "src/operator/././../common/../operator/mxnet_op.h", line 1121

at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808) at ai.djl.mxnet.jna.JnaUtils.waitToRead(JnaUtils.java:459) at ai.djl.mxnet.engine.MxNDArray.close(MxNDArray.java:1549) at ai.djl.ndarray.BaseNDManager.close(BaseNDManager.java:108) at ai.djl.mxnet.engine.MxParameterServer$OptimizerCallback.apply(MxParameterServer.java:105) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:520) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:551) at com.sun.jna.Native.invokeInt(Native Method) at com.sun.jna.Function.invoke(Function.java:426) at com.sun.jna.Function.invoke(Function.java:361) at com.sun.jna.Library$Handler.invoke(Library.java:265) at com.sun.proxy.$Proxy1.MXKVStorePushPullEx(Unknown Source) at ai.djl.mxnet.jna.JnaUtils.parameterStorePushPull(JnaUtils.java:734) at ai.djl.mxnet.engine.MxParameterServer.update(MxParameterServer.java:63) at ai.djl.training.ParameterServer.update(ParameterServer.java:38) at ai.djl.training.ParameterStore.updateAllParameters(ParameterStore.java:74) at ai.djl.training.Trainer.step(Trainer.java:190) at com.kingyu.rlbird.rl.agent.QAgent.trainBatch(QAgent.java:118) at com.kingyu.rlbird.rl.agent.EpsilonGreedy.trainBatch(EpsilonGreedy.java:56) at com.kingyu.rlbird.ai.TrainBird$TrainerCallable.call(TrainBird.java:149) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

[pool-1-thread-2] ERROR ai.djl.ndarray.BaseNDManager - Resource close failed. ai.djl.engine.EngineException: MXNet engine call failed: Name: Check failed: err == cudaSuccess (2 vs. 0) : mxnet_generic_kernel ErrStr:out of memory Stack trace: File "src/operator/././../common/../operator/mxnet_op.h", line 1121

at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808) at ai.djl.mxnet.jna.JnaUtils.waitToRead(JnaUtils.java:459) at ai.djl.mxnet.engine.MxNDArray.close(MxNDArray.java:1549) at ai.djl.ndarray.BaseNDManager.close(BaseNDManager.java:108) at ai.djl.mxnet.engine.MxParameterServer$OptimizerCallback.apply(MxParameterServer.java:105) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jna.CallbackReference$DefaultCallbackProxy.invokeCallback(CallbackReference.java:520) at com.sun.jna.CallbackReference$DefaultCallbackProxy.callback(CallbackReference.java:551) at com.sun.jna.Native.invokeInt(Native Method) at com.sun.jna.Function.invoke(Function.java:426) at com.sun.jna.Function.invoke(Function.java:361) at com.sun.jna.Library$Handler.invoke(Library.java:265) at com.sun.proxy.$Proxy1.MXKVStorePushPullEx(Unknown Source) at ai.djl.mxnet.jna.JnaUtils.parameterStorePushPull(JnaUtils.java:734) at ai.djl.mxnet.engine.MxParameterServer.update(MxParameterServer.java:63) at ai.djl.training.ParameterServer.update(ParameterServer.java:38) at ai.djl.training.ParameterStore.updateAllParameters(ParameterStore.java:74) at ai.djl.training.Trainer.step(Trainer.java:190) at com.kingyu.rlbird.rl.agent.QAgent.trainBatch(QAgent.java:118) at com.kingyu.rlbird.rl.agent.EpsilonGreedy.trainBatch(EpsilonGreedy.java:56) at com.kingyu.rlbird.ai.TrainBird$TrainerCallable.call(TrainBird.java:149) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Downloaded and ran the project without any changes.

What have you tried to solve it?

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

> Task :integration:debugEnv
[DEBUG] - Using cache dir: /home/richi/.djl.ai/mxnet
[DEBUG] - Loading mxnet library from: /home/richi/.djl.ai/mxnet/1.7.0-backport-cu101mkl-linux-x86_64/libmxnet.so
[DEBUG] - Engine loaded from provider: MXNet
[DEBUG] - Found default engine: MXNet
----------- System Properties -----------
sun.cpu.isalist: 
sun.desktop: gnome
sun.io.unicode.encoding: UnicodeLittle
sun.cpu.endian: little
java.vendor.url.bug: http://bugreport.sun.com/bugreport/
file.separator: /
java.vendor: Private Build
sun.boot.class.path: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/resources.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jsse.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/charsets.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jfr.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/classes
java.ext.dirs: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext:/usr/java/packages/lib/ext
java.version: 1.8.0_275
java.vm.info: mixed mode
awt.toolkit: sun.awt.X11.XToolkit
user.language: en
java.specification.vendor: Oracle Corporation
sun.java.command: ai.djl.integration.util.DebugEnvironment
java.home: /usr/lib/jvm/java-8-openjdk-amd64/jre
sun.arch.data.model: 64
java.vm.specification.version: 1.8
java.class.path: /home/richi/Desktop/djl/integration/build/classes/java/main:/home/richi/Desktop/djl/integration/build/resources/main:/home/richi/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.4/c51c00206bb913cd8612b24abd9fa98ae89719b1/commons-cli-1.4.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.13.3/7cca27a921a18645139cf651c04b83b1a19cfd76/log4j-slf4j-impl-2.13.3.jar:/home/richi/Desktop/djl/basicdataset/build/libs/basicdataset-0.10.0-SNAPSHOT.jar:/home/richi/Desktop/djl/model-zoo/build/libs/model-zoo-0.10.0-SNAPSHOT.jar:/home/richi/Desktop/djl/testing/build/libs/testing-0.10.0-SNAPSHOT.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.1.0/b0bcea778fb2899aeb4014c558babea8833d180a/testng-7.1.0.jar:/home/richi/Desktop/djl/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.10.0-SNAPSHOT.jar:/home/richi/Desktop/djl/mxnet/mxnet-engine/build/libs/mxnet-engine-0.10.0-SNAPSHOT.jar:/home/richi/.gradle/caches/modules-2/files-2.1/ai.djl.mxnet/mxnet-native-auto/1.7.0-backport/ee5b368ef94c1fcec4ade4a6edacffb420fefce7/mxnet-native-auto-1.7.0-backport.jar:/home/richi/Desktop/djl/api/build/libs/api-0.10.0-SNAPSHOT.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.30/b5a4b6d16ab13e34a88fae84c35cd5d68cac922c/slf4j-api-1.7.30.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.13.3/4e857439fc4fe974d212adaaaa3b118b8b50e3ec/log4j-core-2.13.3.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.13.3/ec1508160b93d274b1add34419b897bae84c6ca9/log4j-api-2.13.3.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.8/37ca9a9aa2d4be2599e55506a6d3170dd7a3df4/commons-csv-1.8.jar:/home/richi/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.72/6375e521c1e11d6563d4f25a07ce124ccf8cd171/jcommander-1.72.jar:/home/richi/.gradle/caches/modules-2/files-2.1/com.google.inject/guice/4.1.0/faf9ee8ac09eafd1128091426dd367a8c0085d55/guice-4.1.0-no_aop.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.21/18775fdda48574784f40b47bf478ab0593f92e4d/snakeyaml-1.21.jar:/home/richi/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.8.6/9180733b7df8542621dc12e21e87557e8c99b8cb/gson-2.8.6.jar:/home/richi/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.3.0/4654d1da02e4173ba7b64f7166378847db55448a/jna-5.3.0.jar:/home/richi/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.20/b8df472b31e1f17c232d2ad78ceb1c84e00c641b/commons-compress-1.20.jar:/home/richi/.gradle/caches/modules-2/files-2.1/javax.inject/javax.inject/1/6975da39a7040257bd51d21a231b76c915872d38/javax.inject-1.jar:/home/richi/.gradle/caches/modules-2/files-2.1/aopalliance/aopalliance/1.0/235ba8b489512805ac13a8f9ea77a1ca5ebe3e8/aopalliance-1.0.jar:/home/richi/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/19.0/6ce200f6b23222af3d8abb6b6459e6c44f4bb0e9/guava-19.0.jar
user.name: richi
ai.djl.logging.level: debug
file.encoding: UTF-8
java.specification.version: 1.8
java.awt.printerjob: sun.print.PSPrinterJob
user.timezone: Asia/Kolkata
user.home: /home/richi
library.jansi.path: /home/richi/.gradle/native/jansi/1.18/linux64
os.version: 5.8.0-38-generic
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.specification.name: Java Platform API Specification
java.class.version: 52.0
org.gradle.internal.http.connectionTimeout: 60000
java.library.path: /usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
jnidispatch.path: /home/richi/.cache/JNA/temp/jna2886673811441307038.tmp
org.gradle.internal.publish.checksums.insecure: true
sun.jnu.encoding: UTF-8
os.name: Linux
user.variant: 
java.vm.specification.vendor: Oracle Corporation
org.gradle.appname: gradlew
java.io.tmpdir: /tmp
line.separator: 

java.endorsed.dirs: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/endorsed
os.arch: amd64
java.awt.graphicsenv: sun.awt.X11GraphicsEnvironment
java.runtime.version: 1.8.0_275-8u275-b01-0ubuntu1~20.04-b01
java.vm.specification.name: Java Virtual Machine Specification
user.dir: /home/richi/Desktop/djl/integration
org.gradle.internal.http.socketTimeout: 120000
user.country: IN
sun.java.launcher: SUN_STANDARD
sun.os.patch.level: unknown
jna.loaded: true
java.vm.name: OpenJDK 64-Bit Server VM
file.encoding.pkg: sun.io
path.separator: :
java.vm.vendor: Private Build
java.vendor.url: http://java.oracle.com/
sun.boot.library.path: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64
java.vm.version: 25.275-b01
jna.platform.library.path: /usr/lib/x86_64-linux-gnu:/lib/x86_64-linux-gnu:/usr/lib64:/lib64:/usr/lib:/lib:/lib/i386-linux-gnu:/opt/amdgpu/lib/x86_64-linux-gnu:/opt/amdgpu/lib/i386-linux-gnu:/usr/lib/x86_64-linux-gnu/libfakeroot
java.runtime.name: OpenJDK Runtime Environment

--------- Environment Variables ---------
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
XAUTHORITY: /run/user/1000/gdm/Xauthority
INVOCATION_ID: 61a8c96788754f71b5ec13ee145b2e2a
XMODIFIERS: @im=ibus
XDG_DATA_DIRS: /usr/share/ubuntu:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop
GDMSESSION: ubuntu
GTK_IM_MODULE: ibus
DBUS_SESSION_BUS_ADDRESS: unix:path=/run/user/1000/bus
XDG_CURRENT_DESKTOP: ubuntu:GNOME
JOURNAL_STREAM: 8:31621
SSH_AGENT_PID: 1471
COLORTERM: truecolor
QT4_IM_MODULE: ibus
SESSION_MANAGER: local/richi-HP-Pavilion-Gaming-Laptop-15-ec0xxx:@/tmp/.ICE-unix/1622,unix/richi-HP-Pavilion-Gaming-Laptop-15-ec0xxx:/tmp/.ICE-unix/1622
USERNAME: richi
LOGNAME: richi
PWD: /home/richi/Desktop/djl
MANAGERPID: 1151
IM_CONFIG_PHASE: 1
LANGUAGE: en_IN:en
GJS_DEBUG_TOPICS: JS ERROR;JS LOG
SHELL: /bin/bash
LESSOPEN: | /usr/bin/lesspipe %s
OLDPWD: /home/richi/Desktop/djl
GNOME_DESKTOP_SESSION_ID: this-is-deprecated
GNOME_TERMINAL_SCREEN: /org/gnome/Terminal/screen/624b5b21_f44d_4922_9955_d4430f268a24
GTK_MODULES: gail:atk-bridge
CLUTTER_IM_MODULE: ibus
LS_COLORS: rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
XDG_SESSION_DESKTOP: ubuntu
SHLVL: 1
LESSCLOSE: /usr/bin/lesspipe %s %s
QT_IM_MODULE: ibus
TERM: xterm-256color
XDG_CONFIG_DIRS: /etc/xdg/xdg-ubuntu:/etc/xdg
GNOME_TERMINAL_SERVICE: :1.143
LANG: en_IN
XDG_SESSION_TYPE: x11
DISPLAY: :0
XDG_SESSION_CLASS: user
_: ./gradlew
GPG_AGENT_INFO: /run/user/1000/gnupg/S.gpg-agent:0:1
DESKTOP_SESSION: ubuntu
USER: richi
XDG_MENU_PREFIX: gnome-
VTE_VERSION: 6003
QT_ACCESSIBILITY: 1
WINDOWPATH: 2
GJS_DEBUG_OUTPUT: stderr
SSH_AUTH_SOCK: /run/user/1000/keyring/ssh
GNOME_SHELL_SESSION_MODE: ubuntu
XDG_RUNTIME_DIR: /run/user/1000
HOME: /home/richi

-------------- Directories --------------
temp directory: /tmp
Engine cache directory: /home/richi/.djl.ai

------------------ CUDA -----------------
GPU Count: 1
Default Device: gpu(0)
CUDA: 101
ARCH: 75
GPU(0) memory used: 836501504 bytes

----------------- Engines ---------------
Default Engine: MXNet
[DEBUG] - Using cache dir: /home/richi/.djl.ai/mxnet
MXNet:1.7.0, capabilities: [
        CUDNN,
        SIGNAL_HANDLER,
        LAPACK,
        CPU_SSE2,
        CPU_SSE3,
        OPENCV,
        CUDA,
        CPU_SSE,
        CPU_AVX,
        F16C,
        BLAS_OPEN,
        NCCL,
        CPU_SSE4_2,
        DIST_KVSTORE,
        CUDA_RTC,
        CPU_SSE4_1,
        OPENMP,
        MKLDNN,
]
MXNet Library: /home/richi/.djl.ai/mxnet/1.7.0-backport-cu101mkl-linux-x86_64/libmxnet.so

--------------- Hardware --------------
Available processors (cores): 8
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 82435008
Maximum memory (bytes): 1370488832
Total memory available to JVM (bytes): 92798976
Heap committed: 92798976
Heap nonCommitted: 21364736
GCC: 
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7.1/userguide/command_line_interface.html#sec:command_line_warnings
nimishjain15882 commented 3 years ago

I have pasted logs for Ubuntu 20.04 above for now. But I face same issue in Windows as well. Probably if Ubuntu is fixed, Windows will work as well.

stu1130 commented 3 years ago

@nimishjain15882 could you set MXNET_GPU_MEM_POOL_RESERVE to 10 or 20? MXNET_GPU_MEM_POOL_RESERVE is used to tell MXNet how much memory we want to reserve for cuda kernel launch and cudnn handle space, which in this case might resolve CUDNN_STATUS_EXECUTION_FAILED

goswamig commented 3 years ago

assign it to me

iromu commented 3 years ago

I can confirm that:

The latest master version (8ee0b8b) works with no issue and no changes on Ubuntu with a RTX 2080 TI (11GB) with a memory usage of 5% and GPU of arround 60%. Tested removing the FPS limitation.

On windows there is a memory leak. OOM before hitting 15,000 iterations. The GPU usage is reported at 16%, the RAM consumption increases no matter what changes to the code are done:

This points to an issue on the backend native implementations rather than:

Things to try: