Support for (grounded) image captioning

cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.

https://cvat.ai

MIT License

11.97k stars 2.91k forks source link

Support for (grounded) image captioning #4046

Open jbohnslav opened 2 years ago

jbohnslav commented 2 years ago

My actions before raising this issue

[x] Read/searched the docs
[x] Searched past issues

Feature request: recently, there has been much work in combined image-language models. It would be very convenient to be able to annotate image captions for either

individual images
video frames
bounding boxes (grounded image captioning)

An image caption would simply be a free-form input text box.

nmanovic commented 2 years ago

@jbohnslav , thanks for the feature request. Basically it is possible using a text attribute for a label. But I agree that it isn't very convenient. Let's try to optimize the pipeline. Let us know if you can contribute the feature.