Some questions about training a custom dataset

If I want to train my own dataset, the situation of the dataset is as follows:

Dataset Name	Category	Category Name	BBOX	Captions
pedestrian_det	1	person	√	×

It is clear that my dataset is a traditional object detection dataset. Now I want to fine-tune it with YOLO-World, and I want to confirm whether my thinking (steps) is correct:

I first convert the dataset to the COCO format, using the third_party/mmyolo/tools/dataset_converters/yolo2coco.py script here.
Add a description text in the data/texts/ folder, since my dataset has only one person category, the content of this json file is: [["person"]]
I want to find a suitable configuration file for fine-tuning. I have read the documents you provided carefully, and because my dataset only has BBox coordinates and does not have corresponding captions, I should use Normal Finetuning. The configuration file I want to use is: configs/finetune_coco/yolo_world_v2_s_bn_2e-4_80e_8gpus_mask-refine_finetune_coco.py, but unfortunately, I did not find this pre-trained weight on HF, so I used the configuration file configs/finetune_coco/yolo_world_v2_s_vlpan_bn_2e-4_80e_8gpus_mask-refine_finetune_coco.py.

During the training process, there will be some prompts in the terminal:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

I am not sure whether this warning will affect the training.

After that, during the training process, I found that the Loss is a bit too large, as the number of epochs increases (after 77 epochs), loss: 372.6489 loss_cls: 243.9510 loss_bbox: 63.6398 became loss: 109.2642 loss_cls: 24.2213 loss_bbox: 38.7833. I would like to ask if this loss is normal?

My understanding of VL-PAN is that "VL-PAN is used to handle the linking of text data and image data," I do not know if my understanding is correct.

For easy reading, I have summarized my questions as follows:

Q1: Is the step of my fine-tuning correct?
Q2: Does the HF warning affect the training?
Q3: Is the Loss normal?
Q4: Is VL-PAN used to handle the linking of text data and image data?
Q5: How to obtain the weight file in configs/finetune_coco/yolo_world_v2_s_bn_2e-4_80e_1gpu_finetune_coco128.py, specifically: load_from = '../FastDet/output_models/pretrain_yolow-v8_s_clipv2_frozen_te_noprompt_t2i_bn_2e-3adamw_scale_lr_wd_32xb16-100e_obj365v1_goldg_cc3mram250k_train_lviseval-e3592307_rep_conv.pth'. I did not find it in HF.
Q6: The pre-trained weight used has mask-refine, but my dataset does not have mask annotation, can fine-tuning be completed like this?
Q7: What is the difference between Dual VL-PAN and VL-PAN?

Thank you very much for answering my questions😊!

AILab-CVC / YOLO-World

Some questions about training a custom dataset #398