Open javiabellan opened 6 months ago
Hey, that's a really cool idea! Would love for any updates on progress if you get them. Feel free to just email us directly, in case you don't want to broadcast.
One thing that we discovered in the past month is that RADIO currently has a weird behavior where it operates differently at low resolution and high resolution. The switch appears to occur at around 720px, but it's a soft target. Below 720px, the CLIP and DINO heads work as expected, but SAM does not. And then above 720px, the opposite happens. We are calling it "most switching", and we briefly discuss it in the latest ArXiv version.
We think we have figured out how to address it, so I expect we'll release a new version of the model in the near future.
In the interim, if it's giving you trouble, shoot me an email and we'll try to get you sorted out.
Thats switch makes sense, as i remeber from the paper CLIP and DINO was distilled at something near 224~448 and SAM at 1024px.
Backing to my current interest: I want to figure out tree-based detection/segmentation, if i find something i will tell you.
PS: I dont like the grounded-sam approach because is not end2end -> 2 phases and on the later phase SAM has to segment by only having the box as input and not the original text.
I'm very curious to hear how this is going to go. I have been exploring Grounded DINO heavily for producing pseudo 'class probabilities' due to its open-set nature. Basically, I'm using Grounded DINO to produce a class probability distribution for each mask given any set of input classes. These can then be used for downstream tasks.
I wonder whether RADIO can be used to do the same, but in an integrated end-to-end manner.
I want to explore this idea of
This are a-priori other tasks not related to RADIO but radio has all the ingredients (CLIP + DINO + SAM) to solve this problem.
If we look to sota methods wi will find grounded-sam. That is a 2 step process:
I also find very intersting the tree/nested detection of the NanoOWL work. I think (because it is based in OWL-ViT) it only uses the CLIP vision and text encoders.
I would like to explore these (and more) ideas to explore more use cases for the RADIO model :)