Closed mranzinger closed 1 week ago
Hi @mranzinger
Thanks and we believe RADIO is a great work!
Input: Your usage is correct. We use huggingfaces's input processor and it accepts:
list[np.ndarray]
or list[PIL.Image]
, channel last, uint8
torch.Tensor
in (*, H, W, C)
or (*, C, H, W)
, uint8
All these take 0-255 values.
Resolution: The way to do interpolation you found is correct (comes from huggingface's implementation). However, this may not work ideally since we didn't train it on resolutions larger than 224. The reason is that we target robot learning tasks rather than dense prediction tasks. Thanks for sharing the results on ADE20K. As shown by RADIO paper, CPE may give improvements on segmentation tasks. Also, more training images could help.
BibTex: We would love to update the bibtex and congrats for the CVPR publication. Our draft was finished before CVPR conference.
Okay, great. Thank you. Given that the mIOU was roughly similar at both resolutions, your model seems reasonably resilient to changes in resolution. I'll keep playing with your model.
Something I definitely learned with your work is that we should have considered regular ViT as a teacher. I fully did not expect it to be so important, but you proved how valuable it is. When we started RADIO, we had no idea how difficult SAM was going to be for us. We figured "hey, it should help with segmentation", and then spent the next while trying to figure out how to integrate it without it poisoning the model.
Based on the format of your paper, are you targeting ICLR?
Thanks so much for sharing your valuable findings! I think we found something similar about poisoning. We did some preliminary analysis where we found SAM features are pretty easy to predict from other features. Also SAM took a lot of our compute and storage for this distillation purpose, and in the end didn't contribute much improvement. ViT is interesting. We are simply motivated by that it's a classification model which is different than all other teachers we considered.
A good news to share is that Theia has been accepted to CoRL this year :) Sorry for not replying earlier when the paper was under review :)
Congrats on CoRL!
Hi @mranzinger I am also working on related tasks. Could you provide some code details or examples using different resolutions? That would be greatly appreciated.
Hello, excellent work!
In the readme, I don't see any reference to how inputs need to be transformed before usage. Crawling through the code, I found this: https://github.com/bdaiinstitute/theia/blob/main/src/theia/models/backbones.py#L337-L338
So, it suggests to me that the right way to use the model is to pass it an input tensor with values between 0 and 255. Is that the correct usage?
Also, do you have any studies on the resolution interpolation ability of your model? I'm testing it out in an ADE20k semantic segmentation linear probe harness with the following:
just so that it matches our settings for AM-RADIO. I've also tried it with 224px resolution. In both cases, I'm using a sliding window.
My results: 224px: 35.61 mIOU 512px: 35.58
Also, would you be willing to update the bibtex for your reference for AM-RADIO to
I nearly missed your paper because it didn't show up in my "Cited By" section, I think because the citation wasn't complete, and I was thrilled to see your work building in the agglomerative direction.