We have expanded on the original DINO repository https://github.com/IDEA-Research/GroundingDINO by introducing the capability to train the model with image-to-text grounding. This capability is essential in applications where textual descriptions must align with regions of an image. For instance, when the model is given a caption "a cat on the sofa," it should be able to localize both the "cat" and the "sofa" in the image.
See original Repo for installation of required dependencies essentially we need to install prerequisits
python train.py
Visualize results of training on test images
python test.py
For Input text "peduncle.fruit." and input test image
Intially model detects the wrong category and does not detect peduncle (green part) of the fruits
After fine tuning the model can detect the right category of objects with high confidence and detect all parts of fruits as mentioned in text.
Feel free to open issues, suggest improvements, or submit pull requests. If you found this repository useful, consider giving it a star to make it more visible to others!