IDKiro / DehazeFormer

[IEEE TIP] Vision Transformers for Single Image Dehazing
MIT License
369 stars 35 forks source link

RuntimeError: CUDA out of memory. #7

Closed ArraryChen closed 2 years ago

ArraryChen commented 2 years ago

When I tried to train the RESIDE-6k, I met some problems.

"python train.py --model dehazeformer-b --dataset RESIDE-6k --exp reside6k"

/usr/local/lib/python3.7/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:490: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) ==> Start training, current model name: dehazeformer-b 0% 0/1001 [00:04<?, ?it/s] Traceback (most recent call last): File "train.py", line 127, in loss = train(train_loader, network, criterion, optimizer, scaler) File "train.py", line 44, in train output = network(source_img) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(inputs[0], kwargs[0]) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/content/drive/MyDrive/DehazeFormer-main/models/dehazeformer.py", line 486, in forward feat = self.forward_features(x) File "/content/drive/MyDrive/DehazeFormer-main/models/dehazeformer.py", line 466, in forward_features x = self.layer2(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/content/drive/MyDrive/DehazeFormer-main/models/dehazeformer.py", line 308, in forward x = blk(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/content/drive/MyDrive/DehazeFormer-main/models/dehazeformer.py", line 266, in forward x = identity + x RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 14.76 GiB total capacity; 12.17 GiB already allocated; 10.75 MiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried to reduce the batch size to 4, however, it is stuck at

"==> Start training, current model name: dehazeformer-b 0% 0/1001 [00:04<?, ?it/s]"

Somebody can help me? It will be appreciated!

IDKiro commented 2 years ago

what is your GPU? My guess is that your GPU is not powerful enough.

DehazeFormer-B with batch size 4 may takes 8 days to train on one RTX 2080Ti, so about 12min for one epoch.

Maybe you can try train a DehazeFormer-T first. If the problem persists, please let me know.

ArraryChen commented 2 years ago

When changing the batch size to 4, it works. However, it is too slow. A epoch training need more than one hour.

I train on Google Colab and use Tesla T4 GPU.

image image

Is there a faster way to optimize?

IDKiro commented 2 years ago

I have not used Colab, but Tesla T4 seems to be a low-power GPU with weaker performance than RTX 2080Ti (Turing, 75W, 8.1TFLOPS). As far as the information is available now, I think it is likely that the performance of the hardware is insufficient.

IDKiro commented 2 years ago
UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

I see this. It seems that Colab only provides very weak CPU power (4 threads). You may decrease the num_worker in train.py (16 -> 4 or less)

parser.add_argument('--num_workers', default=4, type=int, help='number of workers')
ArraryChen commented 2 years ago
UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.

I see this. It seems that Colab only provides very weak CPU power (4 threads). You may decrease the num_worker in train.py (16 -> 4 or less)

parser.add_argument('--num_workers', default=4, type=int, help='number of workers')

Thanks!