Closed 17328-wu closed 2 years ago
40G GPU memory is required to run the train code (4 RTX2080Ti or 8 RTX3080 are used in our experiment)
You can lower the batch size of the config file and reduce the learning rate according to your GPU.
Take configs/indoor/dehazeformer-t.json
for example:
{
"batch_size": 32,
"patch_size": 256,
"valid_mode": "test",
"edge_decay": 0,
"only_h_flip": false,
"optimizer": "adamw",
"lr": 4e-4,
"epochs":300,
"eval_freq": 1
}
to
{
"batch_size": 8,
"patch_size": 256,
"valid_mode": "test",
"edge_decay": 0,
"only_h_flip": false,
"optimizer": "adamw",
"lr": 1e-4,
"epochs":300,
"eval_freq": 1
}
40G GPU memory is required to run the train code
I modified it according to your method, but it still went wrong.Is it because I don't have enough GPU memory?
Your environment (CUDA PyTorch ... )?
Such an unbalanced GPU allocation does not look like a common mistake.
Can you show the complete nvidia-smi information?
Your environment (CUDA PyTorch ... )?
Such an unbalanced GPU allocation does not look like a common mistake.
Can you show the complete nvidia-smi information?
Your environment (CUDA PyTorch ... )?
Such an unbalanced GPU allocation does not look like a common mistake.
Can you show the complete nvidia-smi information?
Does every GPUs need to have roughly the same amount of memory?
Did you run multiple experiments at the same time? The PID of these python programs are not the same. pytorch does not automatically allocate memory based on each GPU's remaining memory, it is generally allocated evenly
Your environment (CUDA PyTorch ... )? Such an unbalanced GPU allocation does not look like a common mistake. Can you show the complete nvidia-smi information?
Does every GPUs need to have roughly the same amount of memory?
Yes
Your environment (CUDA PyTorch ... )? Such an unbalanced GPU allocation does not look like a common mistake. Can you show the complete nvidia-smi information?
Does every GPUs need to have roughly the same amount of memory?
Yes
thanks,The program is up and running
Your environment (CUDA PyTorch ... )? Such an unbalanced GPU allocation does not look like a common mistake. Can you show the complete nvidia-smi information?
Does every GPUs need to have roughly the same amount of memory?
Yes
Thank you,the program is up and running,But it's slower. It takes long time to train a epoch.Is this normal?
For RTX-2080Ti * 4: DehazeFormer-T for indoor may takes 12 hours to train. DehazeFormer-L needs 1 week. I recommend increasing the batch size as much as possible and avoid running multiple programs on a single GPU.
DehazeFormer-T for indoor may takes 12 hours to train.
DehazeFormer-L needs 1 week.
OK,Thank you for your response.
Glad to help you!
The following error occurred running the train.py file:
RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
After adding the code "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'",an error is still reported: RuntimeError: CUDA error: out of memory