Open zhouzheng1123 opened 1 year ago
Thank you for your interest in our work.
I had originally not planned to release the data processing code as the process is somewhat complicated however given interest in data processing in other issues as well I have recently added the notebooks that were used for data processing to generate the labels.
It is important to note that the data processing pipeline will vary given the dataset, but the general process will be:
Note that the current state of these processing notebooks may not work perfectly with the current code release or the dataset you have in mind as I have not tested them again recently.
Also, if you intend to use this for detection in the wild (either for deployment or just your own experimentation) I recommend finding as large and diverse a training dataset as possible. Additionally, OpenAI has since released checkpoints for better CLIP models since this work was originally conducted therefore I suggest you use a better encoder for improved results.
Let me know if you have any other questions!
Hello author, I would like to ask why the image encoder can be calculated offline in advance, the training process will enhance the image data, at this time the image encoder of each bbox should be changed, right?
我想问一下如何训练我自己的数据集以及我应该做什么? 期待您的回答!
Hello, may I ask you about your training? If possible, I would like to add your contact information.
@hhaAndroid The image encoder used to generate image embeddings is a frozen CLIP image tower. There is no change to those weights over the period of training. The weights that do change are the YOLOv5 weights which attempt to estimate the CLIP weights by aligning a semantic vector output with the vector produced by the CLIP image tower (and the language alignment task as well).
I would like to ask how to train my own dataset and what I should do? Look forward to your answers!