Open XieKaiwen opened 5 months ago
Hi @XieKaiwen, I'll reply to you as soon as possible. Indeed, it's an important issue.
Hi @wondervictor ... Any update regarding this? Thanks!
Hi fellow TIL 2024 participant,
2 things i can point out from your provided info.
As an addon, folks at Ultralytics has implemented yolo world v2 training in a much easier to use way. You can check that out. This repository is more for academic use which could explain its general lack of documebtation, but much higher flexibility, which you probably dont need yet.
How do you define your custom dataset and what method do you use to add the caption?
@CZ999-07 I followed the format in the LVIS dataset. and the captions I merged the name of all the different objects in the image into one sentence, then I used the negative tokens and positive tokens to indicate each one for the annotations.
我遵循了LVIS数据集中的格式。在标题中,我将图像中所有不同对象的名称合并为一个句子,然后我使用负标记和正标记来表示注释中的每一个。
The custom dataset I used for fine tuning, the dataset is the traditional yolov5 format converted to coco format, nothing else has changed. I don't have a caption for it, I want to implement a way to give the model some linguistic cues to make it more accurate when reasoning, so I think the dataset should have a caption for that, is it one for each category, or for all targets in a picture, for example, for a picture of an apple tree with a lot of fruits on it, is it just a caption for the this category of apples or every apple. Also, is there any method script required to add caption to a dataset. I am a novice in this area, thanks for answering this question.
@CZ999-07 You captions really depends on how you want to use the model. I merged all the names of the different classes into one sentence, e.g. "plane . helicopter . missile ." because the only thing I was doing was doing open-set detection, and nothing other than that, I did not have to include any complex reasoning in my captions, e.g. "upside down plane" or something like that.
So in your example for apples, if you want to detect each apple individually, then u can follow my format, which is just "apple". But lets say u only want to detect the apples on the tree, then ur caption needs to be more complicated like "apples on the tree" instead of just "apple". In the end it just depends on what u want to do with ur model. So if you want to have a bounding box over all of the apples in ur picture, u just need to use "apple" as the text prompt.
If you want to implement a way for the model to take in more of the linguistic cues, u can also experiment with the text threshold hyperparameter (in one of the config files if i remember correctly). I think the negative tokens and positive tokens u indicate for each annotation also will affect how the model learns
As for method script required to add caption to a dataset. If your captions follow a very fixed structure, u can consider writing a code for it, but if you dont think that is feasible then manual annotation might be needed.
I will insert a snippet of my dataset below to show you what the dataset format is like because I also had problems with it when I was working on this last time. You have to go and find the flickr dataset yourself because the link on the github repo does not work (only the link on the ultralytics page worked). You can reference the flickr dataset format to create ur dataset.
Can find the flickr dataset under the prepare-datasets section of their docs
Note: I do not know if open-set detection here has been fixed yet, because it seems like there is an issue with it that the github repo author is working on right now as well
你的字幕实际上取决于你想如何使用这个模型。我将所有不同类别的名称合并为一个句子,例如“plane .直升机。因为我唯一做的就是做开放式检测,除此之外,我不必在我的标题中包含任何复杂的推理,例如“倒置的飞机”或类似的东西。
因此,在您的苹果示例中,如果您想单独检测每个苹果,那么您可以遵循我的格式,即“苹果”。但是,假设你只想检测树上的苹果,那么你的标题需要更复杂,比如“树上的苹果”,而不仅仅是“苹果”。最后,这只取决于你想用你的模型做什么。因此,如果你想在图片中的所有苹果上都有一个边界框,你只需要使用“苹果”作为文本提示。
如果你想实现一种让模型接受更多语言线索的方法,你也可以试验文本阈值超参数(如果我没记错的话,在其中一个配置文件中)。我认为 u 为每个注释指示的负标记和正标记也会影响模型的学习方式
至于向数据集添加标题所需的方法脚本。如果你的字幕遵循一个非常固定的结构,你可以考虑为它编写一个代码,但如果你认为这是不可行的,那么可能需要手动注释。
我将在下面插入我的数据集的片段,向您展示数据集格式是什么样的,因为我上次在处理这个问题时也遇到了问题。你必须自己去找 flickr 数据集,因为 github 仓库上的链接不起作用(只有 ultralytics 页面上的链接起作用)。您可以引用 flickr 数据集格式来创建您的数据集。
可以在他们文档的 prepare-datasets 部分下找到 flickr 数据集
注意:我不知道这里的开集检测是否已修复,因为 github repo 作者现在也在处理它似乎存在一个问题
你所采用的的集成在v8上的yolo_world还是AIlab的源码呢
@CZ999-07 i used the code from this github repo
hi, no offense at all to any of the authors of this github page who have worked very hard to help me with my questions and also worked very hard on this project. I just want to give a suggestion that, there should be a more detailed guide/more resources on this page to help people who want to perform training on open-set detection (because for closed-set detection the support seems to be quite sufficient). For example, in the config files, the example finetuning config file example given is for coco_dataset which is usually used for close set detections and not for MixedGroundingDatasets. Alot of the settings in the config files in this github repo are also not tailored to strictly Open-Set finetuning. This has made my experience in trying to leverage on your hardwork relatively harder and shaky because of being uncertain on the exact things to do in order to shape it to my own usage.
That being said, the reason why I am rather confused and lost in doing finetuning using a MixedGroundingDataset is because of my strange results. (Summary below)
Here is an example of the comparison between the pretrained model and my finetuned model on the same image with the same caption.
Pretrained Model:
Finetuned Model:
Original image:
Caption used: "blue and white commercial aircraft . red, white, and blue fighter jet . white, black, and grey missile . white drone . black fighter jet . red and white missile . " - I split the class_names when doing inference by using "." instead of ","
Problem: Of course, it can be seen that the bounding boxes has indeed improved and precision of the bounding boxes have been improved. But I am quite confused because, each item in this picture should have a unique label. Instead of each having a unique label, some are classified wrong, not only classified wrongly but also classified wrongly with a CONFIDENCE of 1.0 which makes it impossible to do any general data post processing as well. - This finetuned model was finetuned for 5 epochs on my MixedGroundingDataset (my dataset example would be below)
My configuration file:
I removed the test and validation code in the configuration file because I do not have a validation dataset. I set the image_scale property to (1536, 896) with my images (added padding).
Below are examples of my dataset:
Below are the logs for my training epochs (1st vs 5th)
As can be seen the loss has a trend of decreasing, which is a good sign (supported by the bounding boxes being of better quality). But that still does not explain why the confidence of every prediction is mostly so close to 1.0, with low confidence bboxes being very rare in the predictions.
Hence I am wondering if I screwed up in either my training process, my configuration file or my dataset, or I just have not trained the model enough (i previously did a 20 epoch training and 30 epoch training as well on my model but it seems like the same thing happened again, 5 epochs this time was because I am experimenting to try and solve this issue)
Summary: Training for various epochs, all lead to questionable results, bounding boxes with suspicious confidence values (even with misclassification). Inquiring for help in any way or form to diagnose the root cause of this issue, i personally am suspecting I wrongly editted something in my configuration file or dataset or just simply didnt use good enough hyperparameters when training. THANKS FOR HELP PROVIDED, IT IS VERY APPRECIATED