deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.07k stars 650 forks source link

Huge commited memory consumption in TrainMnist.java example with 1.8.0-mkl-win-x86_64 #971

Closed blaze79 closed 2 years ago

blaze79 commented 3 years ago

TrainMnist.java example 0.12.0-SNAPSHOT on Windows 10 always fails with no memory for Java Runtime exeption Reproduced on Java 8 (8.0_291-b10) (build 1.8.0_291-b10) and Java 11 (build 18.9 (11.0.10+8))

with 16 GB Mem, no swap. java.exe consumed ~1.3Gb mem, 8.5Gb free - failed with exception on 21% training first epoch commited memory ~ 8Gb

with 16GB Mem + 16GB swap java.exe consumed ~2Gb mem, RAM 9Gb free - failed with exception on 72% training first epoch commited memory ~ 16Gb

Change library to 1.8.0-cu102mkl-win-x86_64 - works fine, 3Gb memory commited, no problems

Error file attached hs_err_pid4020.log

lanking520 commented 3 years ago

We also observed something similar with CPU build of MXNet. It is a known bug for memory increase during the training. Please try to use the Linux subsystem and consume DJL Linux package for MXNet, you will see a very low memory consumption.

GPU should be fine since it is compiled in a different way.

frankfliu commented 2 years ago

Feel free to re-open if you still have questions