Issue with Training on one GPU

cnulab / RealNet

Offical implementation of "RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection (CVPR 2024)"

MIT License

226 stars 13 forks source link

Issue with Training on one GPU #31

Open MadiElHadj opened 5 months ago

MadiElHadj commented 5 months ago

Hello, thank you for the code availability. While trying to run the train code, I have Benn stuck with the distributed Training part. I have only one GPU available and the code requires having configuration related to DDP. The error is within "rank = int(os.environ["RANK"])". Is there any way to run it on one single GPU ? Thank you in advance

cnulab commented 5 months ago

Hello! Normally, our code can run on a single GPU by setting nproc_per_node=1, for example：

$ python -m torch.distributed.launch --nproc_per_node=1  train_diffusion.py --dataset MVTec-AD

If you still can't run it with the above command, please try to provide more error information.

MadiElHadj commented 5 months ago

I am actually getting this error: Capture d'écran 2024-05-31 104155

along with this message : Capture d'écran 2024-05-31 104337

cnulab commented 5 months ago

I'm not sure if this issue is caused by the GPU and DDP. If you have defined your own dataset, you need to add your custom dataset to the choices in the arguments. Alternatively, you can paste the complete error message here.

MadiElHadj commented 5 months ago

I am not using my own dataset for the training, I was just trying to run the eval code but there is no weights of the realnet network, so i have tried to run the train script to have the weights. This s the complete error

cnulab commented 5 months ago

Sorry, I don't know the exact cause of your error; it could be due to various reasons. You may refer to the following: https://github.com/Vision-CAIR/MiniGPT-4/issues/237

Update lower version of torch. I haven't tested the code on torch 2.x.
Reduce the batch size. I'm not sure if it will be effective.

MadiElHadj commented 4 months ago

Hello, Thank you for the insights. I have resolved the issue but I still getting another error with CUDA Memory exceeded. Is there any minimal requirement for the GPU ?

cnulab commented 4 months ago

Training diffusion models requires relatively high GPU memory. It is recommended to use a 48G (or larger) GPU, allowing for a batch size of 5. If you only have a 24G GPU, you can only set the batch size to 1. If you only want to reproduce the experimental results, you can use the checkpoints I provided.

MadiElHadj commented 4 months ago

The checkpoints for the Diffusion and classifier are available but when I wanted to reproduce the results of RealNet, I got an error of a missing checkpoint. [Errno 2] No such file or directory: 'experiments/MVTec-AD/realnet_checkpoints/bottle/ckpt_best.pth.tar'. In this case, I thought of training the realnet network with the following command : $ python -m torch.distributed.launch --nproc_per_node=1 train_realnet.py --dataset MVTec-AD --class_name bottle. And at that level, I got the CUDA Memory issue

cnulab commented 4 months ago

The checkpoints for the Diffusion and classifier are available but when I wanted to reproduce the results of RealNet, I got an error of a missing checkpoint. [Errno 2] No such file or directory: 'experiments/MVTec-AD/realnet_checkpoints/bottle/ckpt_best.pth.tar'. In this case, I thought of training the realnet network with the following command : $ python -m torch.distributed.launch --nproc_per_node=1 train_realnet.py --dataset MVTec-AD --class_name bottle. And at that level, I got the CUDA Memory issue

The training of RealNet only requires 24GB of GPU memory.

wang20001220 commented 1 month ago

Hello, Thank you for the insights. I have resolved the issue but I still getting another error with CUDA Memory exceeded. Is there any minimal requirement for the GPU ?您好，感谢您的见解。我已经解决了该问题，但仍然收到另一个错误：CUDA 内存超出。对 GPU 有最低要求吗？

你好，我遇到了和你相同的问题torch版本2.3，cuda版本11.8，请问你是降低了torch版本解决的吗？如果是我应该降到哪个版本？

cnulab commented 1 month ago

Hello, Thank you for the insights. I have resolved the issue but I still getting another error with CUDA Memory exceeded. Is there any minimal requirement for the GPU ?您好，感谢您的见解。我已经解决了该问题，但仍然收到另一个错误：CUDA 内存超出。对 GPU 有最低要求吗？

你好，我遇到了和你相同的问题torch版本2.3，cuda版本11.8，请问你是降低了torch版本解决的吗？如果是我应该降到哪个版本？

这个代码没有在torch 2.x 上测试过，我使用的是torch1.11