This adds the Image-Conditioned Detection feature in the original OwlVit repo (sorta), in which you use example images of the objects to detect.
The difference between this and the original OwlVit feature is that you also include one or more text prompts with each query image to let the model find the correct embedding for the query image. The original Owlvit had utility functions with heuristics to find the best embedding automatically. I tried incorporating those but found this method much more reliable.
This adds the Image-Conditioned Detection feature in the original OwlVit repo (sorta), in which you use example images of the objects to detect.
The difference between this and the original OwlVit feature is that you also include one or more text prompts with each query image to let the model find the correct embedding for the query image. The original Owlvit had utility functions with heuristics to find the best embedding automatically. I tried incorporating those but found this method much more reliable.