AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
3.88k stars 372 forks source link

Some questions about training a custom dataset #398

Open Le0v1n opened 1 week ago

Le0v1n commented 1 week ago

If I want to train my own dataset, the situation of the dataset is as follows:

Dataset Name Category Category Name BBOX Captions
pedestrian_det 1 person ×

It is clear that my dataset is a traditional object detection dataset. Now I want to fine-tune it with YOLO-World, and I want to confirm whether my thinking (steps) is correct:

  1. I first convert the dataset to the COCO format, using the third_party/mmyolo/tools/dataset_converters/yolo2coco.py script here.
  2. Add a description text in the data/texts/ folder, since my dataset has only one person category, the content of this json file is: [["person"]]
  3. I want to find a suitable configuration file for fine-tuning. I have read the documents you provided carefully, and because my dataset only has BBox coordinates and does not have corresponding captions, I should use Normal Finetuning. The configuration file I want to use is: configs/finetune_coco/yolo_world_v2_s_bn_2e-4_80e_8gpus_mask-refine_finetune_coco.py, but unfortunately, I did not find this pre-trained weight on HF, so I used the configuration file configs/finetune_coco/yolo_world_v2_s_vlpan_bn_2e-4_80e_8gpus_mask-refine_finetune_coco.py.

During the training process, there will be some prompts in the terminal:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

I am not sure whether this warning will affect the training.

After that, during the training process, I found that the Loss is a bit too large, as the number of epochs increases (after 77 epochs), loss: 372.6489 loss_cls: 243.9510 loss_bbox: 63.6398 became loss: 109.2642 loss_cls: 24.2213 loss_bbox: 38.7833. I would like to ask if this loss is normal?

My understanding of VL-PAN is that "VL-PAN is used to handle the linking of text data and image data," I do not know if my understanding is correct.

For easy reading, I have summarized my questions as follows:

Thank you very much for answering my questions😊!

GUWOGANSHOU commented 5 days ago

I used the script (yolo_world_v2_l_vlpan_bn_sgd_1e-3_40e_8gpus_finetune_coco.py) for fine tuning a custom dataset with only one class, and I also encountered similar problems. The loss was always large and could not be reduced.

grad_norm: nan  loss: 194.9646  loss_cls: 69.6331  loss_bbox: 56.4941  loss_dfl: 68.8373
coco/bbox_mAP: 0.0020  coco/bbox_mAP_50: 0.0130  coco/bbox_mAP_75: 0.0000  coco/bbox_mAP_s: -1.0000  coco/bbox_mAP_m: -1.0000  coco/bbox_mAP_l: 0.0020