awslabs / keras-apache-mxnet

[DEPRECATED] Amazon Deep Learning's Keras with Apache MXNet support
https://github.com/awslabs/keras-apache-mxnet/wiki
Other
290 stars 65 forks source link

Possible Memory Leak #195

Open Cpruce opened 6 years ago

Cpruce commented 6 years ago

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or on the Keras Slack channel instead of opening a GitHub issue.

Thank you!

Please see:

https://discuss.mxnet.io/t/possible-memory-leak/1973

roywei commented 6 years ago

@Cpruce Thanks for the issue, I am looking into this, possibly caused by the use of foreach operator.

Cpruce commented 6 years ago

@roywei Thanks for looking into this

roywei commented 6 years ago

@Cpruce I was able to narrow down the memory leak at validation time after each epoch. For now, removing validation during model.fit() resolved this, and use model.evaludate(test_data, test_label) to do validation at the end works fine. We are using bucketing module in keras-mxnet, maybe switching bucket between train and validation caused the memory leak in foreach operator. Need to take another look at that.

Cpruce commented 6 years ago

@roywei awesome thanks I'll try it out soon 👍

roywei commented 6 years ago

For now removing validation dataset resolves the memory leak issue using the following command for training:

history = model1.fit(x_train, y_train,
                    epochs=epochs,
                    batch_size=batch_size,
                    callbacks=[reduce_lr],
                    verbose=2)

need to investigate on how to re-enbale validation stage

julioasotodv commented 6 years ago

I can confirm that the memory leak is happening in mxnet-mkl 1.13.1 under Linux, when running the imdb_bidirectional_lstm.py in the examples folder (which includes a validation set)

MandarGogate commented 6 years ago

There is no memory leak when mxnet-cu90mkl==1.2.1 is used. However, mxnet-cu90mkl==1.3.1 throws error when validation data is used.