cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
https://cvat.ai
MIT License
11.97k stars 2.91k forks source link

Support for (grounded) image captioning #4046

Open jbohnslav opened 2 years ago

jbohnslav commented 2 years ago

My actions before raising this issue

Feature request: recently, there has been much work in combined image-language models. It would be very convenient to be able to annotate image captions for either

  1. individual images
  2. video frames
  3. bounding boxes (grounded image captioning)

An image caption would simply be a free-form input text box.

nmanovic commented 2 years ago

@jbohnslav , thanks for the feature request. Basically it is possible using a text attribute for a label. But I agree that it isn't very convenient. Let's try to optimize the pipeline. Let us know if you can contribute the feature.