deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.07k stars 650 forks source link

memory leak on dataset iteration #2289

Closed enpasos closed 1 year ago

enpasos commented 1 year ago

Description

On running the FashionMnist example from DJL Docs I experience a GPU memory leak of about 503 Bytes on each dataset iteration. grafik illustrates the memory grows on GPU per epoch.

I see this increase even if the batch iteration is reduced to just the iteration without doing something else. I experience this loss without and with the suggested fix https://github.com/deepjavalibrary/djl/pull/2273 to clean up orphaned NDArrays.

Expected Behavior

No memory leak.

How to Reproduce?

I set up a toy app based on djl fashion mnist to reproduce the problem I experience:

git clone https://github.com/enpasos/reproducebug4.git
cd reproducebug4
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

To further localize the cause:

git clone https://github.com/enpasos/reproducebug4.git
cd reproducebug4
git checkout localizing_memory_leak
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

What have you tried to solve it?

Looking for the cause. Did not find it yet.

Environment Info

KexinFeng commented 1 year ago

Is it the same issue as https://github.com/deepjavalibrary/djl/issues/2210? Is it solved after applying the patch https://github.com/deepjavalibrary/djl/pull/2232?

enpasos commented 1 year ago

Is it solved after applying the patch #2232?

To check I took your latest https://github.com/deepjavalibrary/djl/pull/2232

git clone https://github.com/KexinFeng/djl.git
cd djl
gradlew build -x test
gradlew publishToMavenLocal

and ran the code to reproduce the bug against

git clone https://github.com/enpasos/reproducebug4.git
cd reproducebug4
git checkout localizing_memory_leak
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

but I still see the same bug behaviour

[main] INFO com.enpasos.bugs.Main - ###################################################
[main] INFO com.enpasos.bugs.Main - memory leak of about 503 Bytes/epoch/batch
[main] INFO com.enpasos.bugs.Main - ###################################################
enpasos commented 1 year ago

Is it the same issue as #2210?

It is the same field of problem. The impact of the bug behaviour from https://github.com/deepjavalibrary/djl/issues/2210? reproduced by

git clone https://github.com/enpasos/reproducebug2.git
cd reproducebug2
gradlew build
java -jar app/build/libs/app-0.0.1-SNAPSHOT.jar

is eliminated by the proposed solution https://github.com/deepjavalibrary/djl/pull/2273. (I am not using the word solved here as the suggested solution cleans the garbage, but in the ideal solution there would not be garbage).

However, the behaviour reported here even shows after applying the patch https://github.com/deepjavalibrary/djl/pull/2273.

KexinFeng commented 1 year ago

@enpasos I think I find the possible root cause. Basically, the FashionMnist extends ArrayDataset. The iteration of this data set utilizes the new advanced indexing feature to achieve efficiency optimization, which is introduced in https://github.com/deepjavalibrary/djl/pull/1869.

The advanced indexing has memory leak issue, which is now fixed in https://github.com/deepjavalibrary/djl/pull/2300. So this is the possible root cause. You can apply this patch, then the memory leak issue is expected to be fixed.

enpasos commented 1 year ago

Concrats for eliminating the root cause for this memory leak! Very nice :-) I ran the test case and no more memory leak here.