multi-gpu training with open image dataset v6

Megvii-BaseDetection / YOLOX

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/

Apache License 2.0

9.34k stars 2.19k forks source link

multi-gpu training with open image dataset v6 #993

Open chendongliang87 opened 2 years ago

chendongliang87 commented 2 years ago

I'm using YOLOX_m to train with open image dataset V6, the process freezes here: after monitoring the server status, I noticed the RAM usage keep increasing until full. my training settings: python detection/yolox/train.py -f detection/yolox/exps/open_image_yolox_m.py -b 32 -d 2 --fp16

I'm only using 10 workers.

Thanks in advance for any hint.

FateScript commented 2 years ago

Are you using the latest branch?
What if you reduce num_workers ?

chendongliang87 commented 2 years ago

Are you using the latest branch?

What if you reduce num_workers ?

I'm using e1052df - (4 months ago) append trt engine file(*.engine) to .ignore (#443) - Yonghye Kwon (tag: 0.1.1rc0, tag: 0.1.0)
changed num_workers to 4 but looks like no help: it froze here for more than 10mins... BTW, I noticed this link has been twice as shown in the progress bars above, any idea abt it?https://github.com/Megvii-BaseDetection/YOLOX/blob/dd5700c24693e1852b55ce0cb170342c19943d8b/yolox/data/datasets/coco.py#L64

FateScript commented 2 years ago

Seems you are using cache. Try to remove --cache in your cli, change num_workers to 0 and launch only 1 gpu to execute?

chendongliang87 commented 2 years ago

Seems you are using cache. Try to remove --cache in your cli, change num_workers to 0 and launch only 1 gpu to execute?

just checked I'm not using cache anywhere, have run with 1 GPU, and looks like it works well. any hint about this? Also, the GPU utilization is low no matter how many workers I used. I can see GPU is idle often. I've processed all the annotations before training, and the data loader only needs to perform (load image -> mosaic + mixup), I'm not sure why it is still not efficient.

FateScript commented 2 years ago

Maybe you could try just remove --cache, possible reasons for this might be OpenImage's huge amount of images and your limited memory.

chendongliang87 commented 2 years ago

Maybe you could try just remove --cache, possible reasons for this might be OpenImage's huge amount of images and your limited memory.

thanks for your reply. I didn't specify my --cache in the training command line: python tools/train_custom.py -f exps/open_image/open_image_yolox_m.py -b 32 -d 2 --fp16 -c /YOLOX/weights/yolox_m.pth and if cache is enabled, then I will see the following info in the log:

            logger.info(
                "Caching images for the first time. This might take about 20 minutes for COCO"
            )

but I didn't see any log like this.

FateScript commented 2 years ago

--cache not specified but still caching dataset?

chendongliang87 commented 2 years ago

--cache not specified but still caching dataset?

I didn't specify --cache

abalikhan commented 2 years ago

@chendongliang87 did you resolve your issue? I am facing the same issue.

chendongliang87 commented 2 years ago

@chendongliang87 did you resolve your issue? I am facing the same issue.

no progress on my end. haven't got time to look into deeply.

abalikhan commented 2 years ago

@chendongliang87 did you resolve your issue? I am facing the same issue.

no progress on my end. haven't got time to look into deeply.

Try keeping num_workers = 0 . This should do the trick. At least it worked for me.