What is the difference between supervised object detection and closed (fixed) vocabulary detection ??

Seongwoong-sk commented 8 months ago

Hello. I've got a question regarding article about fine-tuning on custom dataset.

I would like to finetune YOLO-World on my custom dataset.

Here, I am just wondering if I have to name my custom classes for fine-tuning.

If so, what would it differ from original supervised object detection which is required to label custom classes??

Could anyone provide assistance with this?? Thank you in advance :)

wondervictor commented 8 months ago

Hi @Seongwoong-sk, thanks for your interest in YOLO-World! As a detention task, the labels (category or noun phrases) and boxes are necessary. For fine-tuning YOLO-World with your custom datasets, you need to provide the category label for each object (bounding box). However, fine-tuning YOLO-World differs from fine-tuning supervised object detector in two main aspects: (1) YOLO-World provides better pre-training models, which are pre-trained on large-scale datasets and provide better generalization abilities and language understanding capabilities. (2) YOLO-World supports more types of text inputs, e.g., fixed categories, captions or noun phrases.

Seongwoong-sk commented 8 months ago

@wondervictor I really appreciate your help!! By the way, could I sincerely ask another questions if you don't mind?

I have two questions. [Q1] In relation to (1) aspect, I have a question regarding the pre-training model of YOLO World. If I were to fine-tune YOLO-World solely on my custom dataset using the pre-training model of YOLO-World, would catastrophic forgetting be a concern? If so, is it acceptable to only fine-tune this model on my custom dataset, or should I also include the large dataset during fine-tuning? I am not entirely clear on this matter.

[Q2] Regarding captions you mentioned above of (2) aspect, I'm still uncertain about the difference between the text inputs (e.g., fixed categories, captions, or noun phrases) of YOLO-World and of original supervised YOLO model. I mean, as far as I know, when providing a category label, the original supervised YOLO model can also have captions or noun phrases as labeling names. Then, what would be the difference between YOLO-World and the normal YOLO in terms of category labeling?

Thank you in advance!! YOLO-World is a remarkable contribution to the field of Computer Vision. 😎

wondervictor commented 8 months ago

Hi @Seongwoong-sk, it's my pleasure!

Regarding [Q1]: Fine-tuning the custom datasets will affect the zero-shot ability of YOLO-World. As you said, the catastrophic forgetting problem exists. Fine-tuning your custom datasets aims to improve YOLO-World's performance on this task. The catastrophic forgetting problem is also dependent on the custom datasets: if you use data from general scenarios (consistent with the pre-trained datasets), the zero-shot ability will be maintained. If the data used is from some narrow fields or deviates significantly from general scenarios, this zero-shot ability may be reduced. If you hope to efficiently fine-tune YOLO-World with the zero-shot ability kept, we suggest you fine-tune a few layers with fewer epochs, or try to incorporate LoRA. Using large-scale datasets for fine-tuning requires large amounts of computation resources, which is not recommended.

Regarding [Q2]: Firstly, you can define your custom vocabulary including normal categories (e.g., car, person, dog), noun phrases (red car, green car, black dog, white dog), and captions (a child wearing a blue shirt). Secondly, (a) fine-tuning normal detector: You need to treat the vocabulary as a set of categories, such as: [person, red car, green car, black dog, white dog, a child wearing a blue shirt], which is a 6-class classification task. Indeed, you can detect those 6 classes after you fine-tune a normal detector, e.g., YOLOv8. However, there exist two problems:

(1) the normal detector has no zero-shot ability for detecting a grey dog or a blue car, which are beyond the pre-defined 6-class classification task.
(2) the normal detector has no language understanding. It just treats red car and green car as different categories but does not know the differences and connections between the two similar classes. We believe a language model (text encoder) empowers YOLO with language understanding. In addition, the two classes red car and green car as similar and hard to optimize the classification model of the normal detector.

However, those problems do not exist in YOLO-World:

(b) fine-tuning YOLO-World: You can define your vocabulary or directly assign each box with an arbitrary text (category, noun phrases, or caption):

(1) the fine-tuned YOLO-World still has the zero-shot ability trained with [person, red car, green car, black dog, white dog, a child wearing a blue shirt], for example, YOLO-World can detect objects with [a red shirt, a black car, a blue car].
(2) YOLO-World has language understanding. YOLO-World does not rely on classification and can separate the green car and red car without ambiguity.

Seongwoong-sk commented 8 months ago

@wondervictor I would like to express my sincere gratitude for your helpful and precise answer.

Thanks to you, I now have a much clearer understanding of my questions.

I'll give it a try according to what you said !! (I might be able to come back to ask later 😝)

Hope you have a nice day today !

wondervictor commented 8 months ago

Thanks for your interest 😄. If you have any questions about YOLO-World in the future, you're welcome to open a new issue.

taofuyu commented 8 months ago

Hi @Seongwoong-sk, it's my pleasure!

Regarding [Q1]: Fine-tuning the custom datasets will affect the zero-shot ability of YOLO-World. As you said, the catastrophic forgetting problem exists. Fine-tuning your custom datasets aims to improve YOLO-World's performance on this task. The catastrophic forgetting problem is also dependent on the custom datasets: if you use data from general scenarios (consistent with the pre-trained datasets), the zero-shot ability will be maintained. If the data used is from some narrow fields or deviates significantly from general scenarios, this zero-shot ability may be reduced. If you hope to efficiently fine-tune YOLO-World with the zero-shot ability kept, we suggest you fine-tune a few layers with fewer epochs, or try to incorporate LoRA. Using large-scale datasets for fine-tuning requires large amounts of computation resources, which is not recommended.

Regarding [Q2]: Firstly, you can define your custom vocabulary including normal categories (e.g., car, person, dog), noun phrases (red car, green car, black dog, white dog), and captions (a child wearing a blue shirt). Secondly, (a) fine-tuning normal detector: You need to treat the vocabulary as a set of categories, such as: [person, red car, green car, black dog, white dog, a child wearing a blue shirt], which is a 6-class classification task. Indeed, you can detect those 6 classes after you fine-tune a normal detector, e.g., YOLOv8. However, there exist two problems:

(1) the normal detector has no zero-shot ability for detecting a grey dog or a blue car, which are beyond the pre-defined 6-class classification task.

(2) the normal detector has no language understanding. It just treats red car and green car as different categories but does not know the differences and connections between the two similar classes. We believe a language model (text encoder) empowers YOLO with language understanding. In addition, the two classes red car and green car as similar and hard to optimize the classification model of the normal detector.

However, those problems do not exist in YOLO-World:

(b) fine-tuning YOLO-World: You can define your vocabulary or directly assign each box with an arbitrary text (category, noun phrases, or caption):

(1) the fine-tuned YOLO-World still has the zero-shot ability trained with [person, red car, green car, black dog, white dog, a child wearing a blue shirt], for example, YOLO-World can detect objects with [a red shirt, a black car, a blue car].

(2) YOLO-World has language understanding. YOLO-World does not rely on classification and can separate the green car and red car without ambiguity.

As you said: (1) the fine-tuned YOLO-World still has the zero-shot ability trained with [person, red car, green car, black dog, white dog, a child wearing a blue shirt], for example, YOLO-World can detect objects with [a red shirt, a black car, a blue car].

So, can the fine-tuned YOLO-World detect [airplane, cellphone] ?

AILab-CVC / YOLO-World

What is the difference between supervised object detection and closed (fixed) vocabulary detection ?? #57