Fine-tune with bounding box data only?

I have a large dataset of which only a small portion is labelled.

I wish to use Grounded-SAM in order to speed up the labelling process with automated labelling.

I am wanting bounding box data therefore I have been attempting to use text prompts in order to obtain the relevant bounding boxes within the image.

However the performance from text prompts in its default state are quite poor. For example I have images of metal containers on trailers. Giving prompts such as 'metal container' or 'trailer bed' can cause a number of outputs, all of which are incorrect. Either the entire container + trailer combination is simply labelled as the trailer. Or it will miss the trailer and only label the container. Or the container is mis-labeled as a trailer.

I have tried the demo here https://segment-anything.com/demo and I can see that the segmentation is fine and the model can clearly differentiate between the objects in terms of an abstract segmentation, it is just failing when I provide text prompts.

Now I was wondering if it is at all possible to fine-tune the model using the existing labelled data I have. Is it possible to train a model such as this using only bounding-box annotations rather than pixel masks?

IDEA-Research / Grounded-Segment-Anything

Fine-tune with bounding box data only? #447