there are always the same issue,when i run train.py in different gpu

ljx93 commented 10 months ago

RuntimeError: CUDA out of memory. Tried to allocate 938.00 MiB (GPU 0; 4.00 GiB total capacity; 1.81 GiB already allocated; 931.70 MiB free; 1.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size _mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ljx93 commented 10 months ago

i also do it on colab,the result has never changed: torch.cuda. Out Of Memory Error: CUDA out of memory. Tried to allocate 6.35 GiB. GPU 0 has a total capacty of 14.75 GiB of which 210.81 MiB is free. Process 9980 has 14.53 GiB memory in use. Of the allocated memory 13.64 GiB is allocated by PyTorch, and 24.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hlings commented 10 months ago

Hi could you share the config and training command? I think the problem may exist in setting the batchsize of source and target data. I implement the method using a relatively small bs of the target dataset (bs_t) for every batch, and the total batchsize is source batchsize plus target batchsize (bs_s + bs_t)

ljx93 commented 10 months ago

Thanks for your reply. train_source: [D:\AsyFOD-main\VOCdevkit\images\train, D:\AsyFOD-main\VOCdevkit\images\train, D:\AsyFOD-main\VOCdevkit\images\train, D:\AsyFOD-main\VOCdevkit\images\train] train_target: D:\AsyFOD-main\VOCdevkit\images\train training commond:python train.py --img 640 --batch 2 --epochs 300 --data ./data/eg/city_and_foggy8_3.yaml --cfg ./models/yolov5x.yaml --hyp ./data/hyp_aug/m1.yaml --weights ' ' --name "test" I have changed the batch_size,but it didn't work.Maybe I didn't understand "setting the batch_size of source and target data",please tell me where to find it.

Hlings commented 10 months ago

Hi Can you share your GPU's type and memory for this experiment?

ljx93 commented 10 months ago

Do you mean this: Using torch 1.10.0+cu102 CUDA:0 (GeForce GTX 1650, 4096MB), CUDA out of memory. Tried to allocate 938.00 MiB (GPU 0; 4.00 GiB total capacity; 1.81 GiB already allocated; 931.70 MiB free; 1.83 GiB reserved in total by PyTorch) , I also do that on RTX4060ti,and get the same problem.

Hlings commented 10 months ago

Yeah. I think GTX1650 is hard to use for running YOLOv5 X, i.e., only 4G or 8G memory. The minimum total batchsize is more than 8, and the 8G memory is insufficient. You can use a smaller detector like YOLOv5 L/M/S and change the dimension in the Ranker class (you can refer to other issues for this modification).

ljx93 commented 10 months ago

Thank you for your advice,but when I use yolov5s and yolov5m to run train.py,I also get the same problem,just have the different memory allocate for pytoch,here are the result: yolov5s: RuntimeError: CUDA out of memory. Tried to allocate 1.31 GiB (GPU 0; 4.00 GiB total capacity; 1.41 GiB already allocated; 1.31 GiB free; 1.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF yolov5m: RuntimeError: CUDA out of memory. Tried to allocate 1.24 GiB (GPU 0; 4.00 GiB total capacity; 1.49 GiB already allocated; 1.24 GiB free; 1.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hlings commented 10 months ago

Hello, I recheck the code, and the target batchsize is set to 1 (see here), the minimum value for experiments. So, If the experiments still can't run with the total batchsize as 2, I think the 8G (or 4G) memory is really not enough. And also, you can refer to yolov5 official repo, check if you can run this YOLOv5 model, then you can share your experience here and let me check the problem :)

ljx93 commented 10 months ago

Thank you for your advice,firstly,I tried yolov5s.yaml on RTX4060(8G),and the result is here: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.97 GiB (GPU 0; 8.00 GiB total capacity; 3.06 GiB already allocated; 3.45 GiB free; 3.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF I find this time my memory is enough,but it didn't work.

Hlings commented 10 months ago

Can you run the official yolov5 repo I list above on RTX4060?

Hlings commented 10 months ago

Also, this information is somewhat strange, like the free memory (3.45GB) > required memory (2.97GB), but it can run.

ljx93 commented 10 months ago

Can you run the official yolov5 repo I list above on RTX4060?

Before this,I run the yolov7 on RTX1650,and it works well.I will find a time to run yolov5,and tell you the result.

ljx93 commented 10 months ago

Also, this information is somewhat strange, like the free memory (3.45GB) > required memory (2.97GB), but it can run.

I also can't understand it,and just now,I tried it on RTX4090, which has the same problem

ljx93 commented 10 months ago

Also, this information is somewhat strange, like the free memory (3.45GB) > required memory (2.97GB), but it can run.

Do you have any experience on this poblem?

Hlings commented 10 months ago

Can you run the official yolov5 repo I list above on RTX4060?

Before this,I run the yolov7 on RTX1650,and it works well.I will find a time to run yolov5,and tell you the result.

Hi Could you share more details about the successful running of Yolo v7? Like the model size, the batchsize etc.

Hlings commented 10 months ago

If convenient, you can share your personal contact (like wechat or twitter?) to my email "gaoyp23@mail2.sysu.edu.cn". So we can discuss closely :)

ljx93 commented 10 months ago

If convenient, you can share your personal contact (like wechat or twitter?) to my email "gaoyp23@mail2.sysu.edu.cn". So we can discuss closely :)

Ok.

Hlings / AsyFOD

there are always the same issue,when i run train.py in different gpu #6