apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

cudaMalloc retry failed #19499

Closed seekFire closed 3 years ago

seekFire commented 3 years ago

Description

Error occurs in logging stage:

logger.info("Training: [epoch: %d, steps: %d, learning_rate: %.2e, batch_loss: %.4f, batch_time: %.2fs]"
                        % (i, step_num, trainer.learning_rate, batch_loss.mean().asscalar(), cost_batch))

Error Message

mxnet.base.MXNetError: Traceback (most recent call last):
  File "src/storage/./pooled_storage_manager.h", line 161
MXNetError: cudaMalloc retry failed: out of memory

Environment

Python3.7 mxnet-cu102==1.7.0

So could you please help me with these questions?

github-actions[bot] commented 3 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

szha commented 3 years ago

Hi @seekFire. Because MXNet execution is asynchronous, the OOM error likely happened earlier. I'd suggest reducing model size or batch size to make it fit in your current GPU. If you have reason to believe that the current setting should fit in your GPU memory, it would be helpful if you elaborate on that so that I can take a closer look.

seekFire commented 3 years ago

@szha Thank you for your suggestion! When I turn down the batch size to 2 on one GPU it works ok, I'm just surprised that the batch size is so low when training with HRNet-W18 for segmentation... BTW, when I trained the model with one GPU, the batch size can not even be set to 4, but when I trained with 4 GPUs, it will works fine with batch_size = 4, I just wonder what's the difference between these two situations?

szha commented 3 years ago

when I trained with 4 GPUs, it will works fine with batch_size = 4

Is this a per-GPU batch size? I imagine it has to do with the input image sizes.

seekFire commented 3 years ago

@szha Yes, you're right, and the input image size is 512*512, the GPU memory is 12GB.

seekFire commented 3 years ago

@szha Well, I think the question above is not important for me, as long as the model can be trained with at least one image per batch...BTW, when I run some script like below under the shell environment, it works OK:

>>> import mxnet as mx
>>> x = mx.nd.ones((2, 3, 4, 5))
>>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
>>> y.shape    # (2, 15, 4)

But the reshape operation in my custom metric function for segmentation task will generate error during evaluating: labels = labels.transpose((0, 2, 3, 1)).reshape(0, -3, -1).argmax(-1)

The error message shown as below: ValueError: can only specify one unknown dimension

And I don't think reshape(0, -3, -1) is ambiguous for a 4-dim tensor, furthermore in the introduction of function mx.nd.NDArray.reshape there has similar demo...

When I rectify reshape(0, -3, -1) to reshape(0, -3, 0), the error changes as below, labels has shape: (1, 2, 512, 512): ValueError: cannot reshape array of size 524288 into shape (0,newaxis,0)

So what do you think about the cause?

leezu commented 3 years ago

Have you enabled the numpy compatible mode? use_np or set_np?

seekFire commented 3 years ago

@leezu No, I don't. but the type of the tensor manipulated is mxnet.ndarray.ndarray.NDArray rather than numpy.ndarray, the special values like 0, -1, -3 are defined in the function reshape of mxnet.ndarray.ndarray.NDArray, and when I used the older version of mxnet(e.g 1.5.0), there has no such error... So I guess whether or not it's a part needs to be improved.

seekFire commented 3 years ago

@leezu I think I may find out the error reason: when I use the class mx.metric.CustomMetric to wrap my custom metric function, the type of input tensor (label & pred) of this function has converted from mxnet.ndarray.ndarray.NDArray to numpy.ndarray automatically, so it will generate this error. The validation process is as follows, same script as above except using numpy to replace mxnet.ndarray:

>>> import numpy as np
>>> x = np.ones((2, 3, 4, 5))
>>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: can only specify one unknown dimension

The error is same as the mentioned above. I think the class mx.metric.CustomMetric of new version(1.7.0) is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK.