Can Grounding DINO be used for image conditioned one-shot or few-shot detection like this?
The image is from OWL-VIT which can do be prompted with either text or an image patch. I feel that DINO should be able to do the same thing and likely perform better?
Please check out NIDS-Net. It includes Grounding-DINO to handle these tasks. It can detect things with just one or a few template images. Moreover, it may not need any training.
Can Grounding DINO be used for image conditioned one-shot or few-shot detection like this?
The image is from OWL-VIT which can do be prompted with either text or an image patch. I feel that DINO should be able to do the same thing and likely perform better?