deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

Memory occupation grows up #212

Closed giobus75 closed 3 years ago

giobus75 commented 3 years ago

Hi, I'm trying to train a customized version of a VGG16 with a huge dataset with the python version of my code and the process was killed due to an oom. To understand if the problem was related to python bindings I wrote a piece of code in C++ keeping as simple as possible the implementation to replicate the issue. The code is here. In the loop across batches, if I comment rows 137-139 and leave uncommented the 136 one (train_batch) everything works fine. On the contrary, if I comment train_batch row and keep lines 137-139 uncommented, as I do during the evaluation of a validation set, memory occupation keeps increasing.

MicheleCancilla commented 3 years ago

Hello @giobus75,

these two lines leak memory:

output = getOutput(out); 
cout << output->select({ to_string(0) }) << endl;

Tensor::select and getOutput functions return a new Tensor* which should be destroyed. Try to change them to:

output = getOutput(out); 
Tensor* select_tensor = output->select({ to_string(0) });
cout << select_tensor << endl;
delete output;
delete select_tensor;
giobus75 commented 3 years ago

Hi @MicheleCancilla, I tried your hint (here the code) but the problem remains. The memory increases also if I leave only the forward function call. I also tried to replace the forward API version with the low-level version: (net->forward( { x } );) but the memory keeps increasing. I don't know if this could be useful but I'm observing also this behavior: I'm using a dataset of 9984 images (256x256x3) and a batch size of 32. At the beginning of each epoch, memory starts to increase until about the 200th batch then stops increasing and restarts with the next epoch and so on.

RParedesPalacios commented 3 years ago

Hi,

then if you run:

train_batch(net, { x }, { y }, indices);

instead of:

forward(net, { x });

is then ok??

RParedesPalacios commented 3 years ago

I have tested this example that use forward function:

https://github.com/deephealthproject/eddl/blob/master/examples/nn/1_mnist/9_mnist_mlp_func.cpp

and effectively memory grows up! i will check and repair asap

giobus75 commented 3 years ago

Hi,

then if you run:

train_batch(net, { x }, { y }, indices);

instead of:

forward(net, { x });

is then ok??

Yes, it is

giobus75 commented 3 years ago

I have tested this example that use forward function:

https://github.com/deephealthproject/eddl/blob/master/examples/nn/1_mnist/9_mnist_mlp_func.cpp

and effectively memory grows up! i will check and repair asap

Thanks a lot, @RParedesPalacios !

RParedesPalacios commented 3 years ago

Hi, i found it, is fixed in develop branch, please check

giobus75 commented 3 years ago

Hi, i found it, is fixed in develop branch, please check

Hi @RParedesPalacios, no more memory growth. Thank you again.

diegobenedicto commented 3 years ago

Just mention that I have experienced this memory leak in Multiple Sclerosis Segmentation training with EDDL 0.7.1, pyEDDL 0.9.0, ECVL 0.2.3, pyECVL 0.5.1 (50 epochs) image

Good to check it has been solved in EDDL 0.8a image