Closed xddun closed 1 year ago
I look forward to you can pay attention to this issue. I see that other people have raised similar questions, but no reply.
Although I have been using a single GPU to train images, I do know that this error pops up when your GPU has less memory size that largely depends on the batch size you chose. So, in your command line (the python train.py...), if you could lower your batch size (I see you set that to 64), maybe then you may not get this error. Let me know if that worked!
Thank you for your reply very much ! I hope my genuine reply can help you improve the repository!
let me give an overview of the current situation, and then talk about some details.
overview:
(1) this is a mistake: i have 4 GPUs๏ผ but their GPU memory usage is too too different in training; (2) batch-size=1 ,this is a bad suggestion when we in yolov7-tiny with imgsize416*416 and we have GPU with 40G gpu memory. and even so, it still doesn't work. and , batch-size=1, means it takes a long long time to complete the training. (3) I will try the following: a. yolov5+my objects 365 datasets (in 2019 year) . this will verify if there is a problem with my dataset. b. yolov7-tiny+objects 365 datasets (in 2020 year) .
(4)I can almost confirm that the problem is in the data loading. Because sometimes I can train for full one epoch, sometimes I can't. It seems that we have to modify some code to make repository code more robust. (5)it work well : yolov7-tiny+coco 2017+ batch-size=any suitable value.
I will describe this problem more carefully:
then I train coco 2017 datasets like this:
python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 16 --device 0,1,2,3 --sync-bn --batch-size 1000 --data data/coco.yaml --img 416 416 --cfg cfg/training/yolov7-tiny.yaml --weights weights/yolov7-tiny.pt --name yolov7tiny_coco --hyp data/hyp.scratch.tiny.yaml
I have to admit, this is work well! Even if I train for more than 10 epochs, it will not reappear "[CUDA out of memory]", The memory usage of the 4 GPUs will remain the same occupy:
After that, I downloaded the objects365 dataset ๏ผ2019 https://github.com/lidc1004/Object-detection-converts ๏ผ, which is json data. I started training after converting it to the yolo format. this is convert code:
# -*- coding: UTF-8 -*-
import json
import os
jsonfile1 = "/ssd/xiedong/datasets/objects365/Annotations/val/val.json"
jsonfile2 = "/ssd/xiedong/datasets/objects365/Annotations/train/train.json"
for jsonfile in [jsonfile1, jsonfile2]:
saveDstPath = os.path.dirname(jsonfile)
with open(jsonfile, 'r', encoding="utf-8") as f:
datas = json.load(f)
id_names = {imt["id"]: imt["name"] for imt in datas["categories"]}
imageid_hw_dict = {}
for d in datas["images"]:
imageid_hw_dict[d["id"]] = [d["width"], d["height"], d["file_name"]]
annotations_imageid_idbox = {}
for d in datas["annotations"]:
if d["image_id"] not in annotations_imageid_idbox:
annotations_imageid_idbox[d["image_id"]] = []
annotations_imageid_idbox[d["image_id"]].append([d["bbox"], d["category_id"]])
# ่ฝฌๆyolo
for imageid in annotations_imageid_idbox:
hw = imageid_hw_dict[imageid]
w = hw[0]
h = hw[1]
filename = hw[2]
with open(os.path.join(saveDstPath, filename.replace(".jpg", ".txt")), "w") as f:
res_str = []
for box1 in annotations_imageid_idbox[imageid]:
box = box1[0]
x_yolo = min((box[0] + box[2] / 2) / w, 1.0)
y_yolo = min((box[1] + box[3] / 2) / h, 1.0)
w_yolo = min(box[2] / w, 1.0)
h_yolo = min(box[3] / h, 1.0)
res_str.append(
"{} {} {} {} {}".format(box1[1] - 1, round(x_yolo, 6), round(y_yolo, 6), round(w_yolo, 6),
round(h_yolo, 6))) # ๅๅป1ๆฏๆ1ๅฐ365ๅๆ0ๅฐ364
f.write("\n".join(res_str))
At the beginning of the training, I couldn't even train one epoch completely ! So I changed the yolo tag to remove small targets and targets that are too close to the edge. this is code:
import os
from tqdm import tqdm
def listPathAllfiles(dirname):
result = []
for maindir, subdir, file_name_list in os.walk(dirname):
for filename in file_name_list:
apath = os.path.join(maindir, filename)
result.append(apath)
return result
path = r"/ssd/xiedong/datasets/objects365/labels"
files = listPathAllfiles(path)
for file in tqdm(files):
with open(file, "r") as f:
lines = f.read().splitlines()
lines_new = []
for line in lines:
if len(line) < 2:
continue
cid, x0, y0, w, h = list(map(float, line.split(" ")))
# ไฟฎๆญฃๅๆ
if x0 >= 0.99 or y0 >= 0.99: # ้คๅปๅคช่พน่ง็ๆฐๆฎ
continue
if x0 < 0.01 and y0 < 0.01: # ้คๅปๅคช่พน่ง็ๆฐๆฎ
continue
if w < 0.01 or h < 0.01: # ้คๅปboxๅคชๅฐ็ๆฐๆฎ
continue
if (x0 + w / 2 > 0.99):
w = (0.99 - x0) * 2
if (y0 + h / 2 > 0.99):
h = (0.99 - y0) * 2
if (x0 - w / 2 < 0.01):
w = (x0 - 0.01) * 2
if (y0 - h / 2 < 0.01):
h = (y0 - 0.01) * 2
if (x0 + w / 2 > 0.99):
w = (0.99 - x0) * 2
if (y0 + h / 2 > 0.99):
h = (0.99 - y0) * 2
if (x0 - w / 2 < 0.01):
w = (x0 - 0.01) * 2
if (y0 - h / 2 < 0.01):
h = (y0 - 0.01) * 2
str1 = str(int(cid)) + " " + str(round(x0, 6)) + " " + str(round(y0, 6)) + " " + str(
round(w, 6)) + " " + str(round(h, 6))
lines_new.append(str1)
with open(file, "w") as f:
f.write("\n".join(lines_new))
At this time, I train the model again, I can train several epochs!
But there are two problems:
(1) As I described at the beginning in this issues, you can see that the 4 GPUs memory occupied is not same .
(2) "RuntimeError: CUDA out of memory" will appear after several rounds of training .
I suspect that this is due to the numerical problem of the labeled data, But I don't know how to solve it.
at last ๏ผset "--batch-size 1 " is useless operation:
I have 4 GPUs, so i have to set "--batch-size 4 ".
when i run:
/ssd/xiedong/miniconda3/envs/py37c/bin/python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 16 --device 0,1,2,3 --sync-bn --batch-size 4 --data data/Objects365_2019.yaml --img 416 416 --cfg cfg/training/yolov7-tiny.yaml --weights weights/yolov7-tiny.pt --name yolov7tiny_obj3652019 --hyp data/hyp.scratch.tiny.yaml --resume
It can be seen from the gpu memory occupation that there is a problem (The following figure shows the datasets progress bar to 1%):
when the datasets progress bar is:
0/299 34.9G 0.0631 0.05328 0.0924 0.2088 19 416: 9%|โโโโโ | 13764/152152 [28:36<4:50:19, 7.94it/s]
can get:
It can be said responsibly that what happens next : [RuntimeError: CUDA out of memory].
Actually, I have searched yolov7 issues before i propose my this issues.
The data processing method is very important. The instability of the numerical value causes this problem.
i use this way: https://github.com/ultralytics/yolov5/blob/master/data/Objects365.yaml ,
it work to me !
It's a mysterious experience.
I have also encounter this problem, what batch size do you use finally?
Thank you for your work, but I think there is still much room for improvement in the practicality of the project. There is no problem with COCO 2017 data, but this problem occurred when I tried to perform 2019 Objects365 data . Can train some epochs, but will have problems:
This GPU memory occupation is not normal, and will rise and change: