Thanks for the great code. I encountered an issue when using the GroundDINO (or maybe it is just expected?)
If I use a long word, like 'pottedplant', it will be tokenized into several sub-words.
when generating the output bounding boxes, some sub-words are ignored (I guess it is because the cross-attention is done in token level so scores of some sub-words are lower than the text threshold), and generated label is incomplete.
For example, the 'pottedplant' -> 'pot' 'ted' 'pl' 'ant', and some box labels are wrong, like 'potted' , 'pottedpl'.
I wonder is there any solution for this?
Thanks for the great code. I encountered an issue when using the GroundDINO (or maybe it is just expected?) If I use a long word, like 'pottedplant', it will be tokenized into several sub-words. when generating the output bounding boxes, some sub-words are ignored (I guess it is because the cross-attention is done in token level so scores of some sub-words are lower than the text threshold), and generated label is incomplete. For example, the 'pottedplant' -> 'pot' 'ted' 'pl' 'ant', and some box labels are wrong, like 'potted' , 'pottedpl'. I wonder is there any solution for this?