An error occurred running train.py ，How can this be resolved？

IDKiro / DehazeFormer

[IEEE TIP] Vision Transformers for Single Image Dehazing

MIT License

377 stars 36 forks source link

An error occurred running train.py ，How can this be resolved？ #6

Closed 17328-wu closed 2 years ago

17328-wu commented 2 years ago

The following error occurred running the train.py file：

RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

After adding the code "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'"，an error is still reported： RuntimeError: CUDA error: out of memory

IDKiro commented 2 years ago

40G GPU memory is required to run the train code (4 RTX2080Ti or 8 RTX3080 are used in our experiment)

You can lower the batch size of the config file and reduce the learning rate according to your GPU.

Take configs/indoor/dehazeformer-t.json for example:

{
    "batch_size": 32,
    "patch_size": 256,
    "valid_mode": "test",
    "edge_decay": 0,
    "only_h_flip": false,
    "optimizer": "adamw",
    "lr": 4e-4,
    "epochs":300,
    "eval_freq": 1
}

{
    "batch_size": 8,
    "patch_size": 256,
    "valid_mode": "test",
    "edge_decay": 0,
    "only_h_flip": false,
    "optimizer": "adamw",
    "lr": 1e-4,
    "epochs":300,
    "eval_freq": 1
}

17328-wu commented 2 years ago

40G GPU memory is required to run the train code

I modified it according to your method, but it still went wrong.Is it because I don't have enough GPU memory?

IDKiro commented 2 years ago

Your environment (CUDA PyTorch ... )?

Such an unbalanced GPU allocation does not look like a common mistake.

Can you show the complete nvidia-smi information?

17328-wu commented 2 years ago

Your environment (CUDA PyTorch ... )?

Such an unbalanced GPU allocation does not look like a common mistake.

Can you show the complete nvidia-smi information?

17328-wu commented 2 years ago

Your environment (CUDA PyTorch ... )?

Such an unbalanced GPU allocation does not look like a common mistake.

Can you show the complete nvidia-smi information?

Does every GPUs need to have roughly the same amount of memory?

IDKiro commented 2 years ago

Did you run multiple experiments at the same time? The PID of these python programs are not the same. pytorch does not automatically allocate memory based on each GPU's remaining memory, it is generally allocated evenly

IDKiro commented 2 years ago

Your environment (CUDA PyTorch ... )? Such an unbalanced GPU allocation does not look like a common mistake. Can you show the complete nvidia-smi information?

Does every GPUs need to have roughly the same amount of memory?

Yes

17328-wu commented 2 years ago

Your environment (CUDA PyTorch ... )? Such an unbalanced GPU allocation does not look like a common mistake. Can you show the complete nvidia-smi information?

Does every GPUs need to have roughly the same amount of memory?

Yes

thanks,The program is up and running

Your environment (CUDA PyTorch ... )? Such an unbalanced GPU allocation does not look like a common mistake. Can you show the complete nvidia-smi information?

Does every GPUs need to have roughly the same amount of memory?

Yes

Thank you,the program is up and running,But it's slower. It takes long time to train a epoch.Is this normal?

IDKiro commented 2 years ago

For RTX-2080Ti * 4: DehazeFormer-T for indoor may takes 12 hours to train. DehazeFormer-L needs 1 week. I recommend increasing the batch size as much as possible and avoid running multiple programs on a single GPU.

17328-wu commented 2 years ago

DehazeFormer-T for indoor may takes 12 hours to train.

DehazeFormer-L needs 1 week.

OK，Thank you for your response.

IDKiro commented 2 years ago

Glad to help you!