Johnathan-Xie / ZSD-YOLO

GNU General Public License v3.0
50 stars 2 forks source link

How to train my own datasets? #3

Open zhouzheng1123 opened 1 year ago

zhouzheng1123 commented 1 year ago

I would like to ask how to train my own dataset and what I should do? Look forward to your answers!

Johnathan-Xie commented 1 year ago

Thank you for your interest in our work.

I had originally not planned to release the data processing code as the process is somewhat complicated however given interest in data processing in other issues as well I have recently added the notebooks that were used for data processing to generate the labels.

It is important to note that the data processing pipeline will vary given the dataset, but the general process will be:

  1. Process object detection dataset into YOLOv5 format. It may have changed since the creation of this repo, but this repository should help with conversion https://github.com/ultralytics/JSON2YOLO
  2. Generate instance wise labels and embeddings in the ZSD-YOLO format using the data-generation.ipynb notebook. Note that the annot_folder argument when creating the data loader is where you will pass in the name of the label folder (which should be adjacent to the image folder). Also, this generation notebook requires a regularly trained YOLO detection model for self-labeling which we describe. in our paper. Though our paper used a specially trained version on a COCO subset for benchmarking, simply using the YOLOv5 releases will likely work better (we could not use these checkpoints for our paper due to benchmarking fairness).
  3. Generate text embeddings using the text-embedding-generation.ipynb notebook.
  4. Create a metadata YAML file which contains information necessary to training the model, examples of these YAML files can be found in the Kaggle release as well.

Note that the current state of these processing notebooks may not work perfectly with the current code release or the dataset you have in mind as I have not tested them again recently.

Also, if you intend to use this for detection in the wild (either for deployment or just your own experimentation) I recommend finding as large and diverse a training dataset as possible. Additionally, OpenAI has since released checkpoints for better CLIP models since this work was originally conducted therefore I suggest you use a better encoder for improved results.

Let me know if you have any other questions!

hhaAndroid commented 1 year ago

Hello author, I would like to ask why the image encoder can be calculated offline in advance, the training process will enhance the image data, at this time the image encoder of each bbox should be changed, right?

Lz-Sadada commented 1 year ago

我想问一下如何训练我自己的数据集以及我应该做什么? 期待您的回答!

Hello, may I ask you about your training? If possible, I would like to add your contact information.

Johnathan-Xie commented 1 year ago

@hhaAndroid The image encoder used to generate image embeddings is a frozen CLIP image tower. There is no change to those weights over the period of training. The weights that do change are the YOLOv5 weights which attempt to estimate the CLIP weights by aligning a semantic vector output with the vector produced by the CLIP image tower (and the language alignment task as well).