AnhPC03 commented 4 years ago

I got an issue that CUDA out of memory although I've changed any batch size such as 1, 2, 4, 8, 16. Any one can help me, please?

neeveermoree commented 4 years ago

Got the same problem. I have a server with 2080ti and 2070 super. Training does not work on multiple gpus. Using only 1 gpu (either from these 2) gives the same error as stated by @nguyentienanh2303 .

a1103688841 commented 4 years ago

i have the same problem. Traceback (most recent call last): File "train.py", line 105, in loss, outputs = model(imgs, targets) File "C:\Users\11036.conda\envs\py37\lib\site-packages\torch\nn\modules\module.py", line 532, in call result = self.forward(*input, **kwargs) File "D:\PyTorch-YOLOv3-master\models.py", line 258, in forward x = layer_outputs[-1] + layer_outputs[layer_i] RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 1.26 GiB already allocated; 19.13 MiB free; 1.29 GiB reserved in total by PyTorch)

a1103688841 commented 4 years ago

I fand the problem is tensorflow have large cuda memory we can

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

jas-nat commented 4 years ago

I fand the problem is tensorflow have large cuda memory we can

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

I tried this and it still didn't work for me. This came out:

Traceback (most recent call last):
  File "train-kaist_all.py", line 161, in <module>
    loss, outputs = model(imgs, targets)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/Documents/PyTorch-YOLOv3/models_mod.py", line 254, in forward
    x = module(x)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 7.79 GiB total capacity; 6.45 GiB already allocated; 26.75 MiB free; 191.87 MiB cached)

Does require_grad matter? Because when I set requires_grad to be false in the optimizer, it worked

113HQ commented 4 years ago

我扇出的问题是张量流具有大的cuda内存，我们可以

logger = Logger（“ logs”）

＃logger.list_of_scalars_summary（tensorboard_log，batches_done）＃logger.list_of_scalars_summary（evaluation_metrics，epoch）

tensorboard_log，batches_done，evaluation_metrics' are not defined

a1103688841 commented 4 years ago

I fand the problem is tensorflow have large cuda memory we can

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

I tried this and it still didn't work for me. This came out:

Traceback (most recent call last):
  File "train-kaist_all.py", line 161, in <module>
    loss, outputs = model(imgs, targets)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/Documents/PyTorch-YOLOv3/models_mod.py", line 254, in forward
    x = module(x)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 7.79 GiB total capacity; 6.45 GiB already allocated; 26.75 MiB free; 191.87 MiB cached)

Does require_grad matter? Because when I set requires_grad to be false in the optimizer, it worked

you can tiry to change it

buttercutter commented 4 years ago

What is the purpose of require_grad?
What is the purpose of subdivision ?

LinXiLuo commented 4 years ago

I fand the problem is tensorflow have large cuda memory we can

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

It works. THX

LinXiLuo commented 4 years ago

I fand the problem is tensorflow have large cuda memory we can

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

I tried this and it still didn't work for me. This came out:

Traceback (most recent call last):
  File "train-kaist_all.py", line 161, in <module>
    loss, outputs = model(imgs, targets)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/Documents/PyTorch-YOLOv3/models_mod.py", line 254, in forward
    x = module(x)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 7.79 GiB total capacity; 6.45 GiB already allocated; 26.75 MiB free; 191.87 MiB cached)

Does require_grad matter? Because when I set requires_grad to be false in the optimizer, it worked

Check dataloader. And set pin_memory to False.

buttercutter commented 4 years ago

Check dataloader. And set pin_memory to False.

@LinXiLuo

What do you exactly mean by pin_memory ? What is the purpose of require_grad ?

prh-t commented 4 years ago

Check dataloader. And set pin_memory to False.

@LinXiLuo

What do you exactly mean by pin_memory ? What is the purpose of require_grad ?

hi @promach any luck resolving this OOM issue?

buttercutter commented 4 years ago

@prh-t See https://github.com/huawei-noah/AdderNet/issues/16#issuecomment-624408420

Biubiubiu12 commented 4 years ago

How to set requires_grad to be false in the optimizer?The other methods above don't work for me.

Biubiubiu12 commented 4 years ago

I fand the problem is tensorflow have large cuda memory we can

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

I tried this and it still didn't work for me. This came out:

Traceback (most recent call last):
  File "train-kaist_all.py", line 161, in <module>
    loss, outputs = model(imgs, targets)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/Documents/PyTorch-YOLOv3/models_mod.py", line 254, in forward
    x = module(x)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/dlsj/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 7.79 GiB total capacity; 6.45 GiB already allocated; 26.75 MiB free; 191.87 MiB cached)

Does require_grad matter? Because when I set requires_grad to be false in the optimizer, it worked

you can tiry to change it

How to set requires_grad to be false in the optimizer?

Flova commented 3 years ago

I closed this issue due to inactivity. Feel free to reopen for further discussion.

eriklindernoren / PyTorch-YOLOv3

Out of Cuda #427

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

logger = Logger（“ logs”）

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)

logger = Logger("logs")

logger.list_of_scalars_summary(tensorboard_log, batches_done)

logger.list_of_scalars_summary(evaluation_metrics, epoch)