Open UCASHurui opened 11 months ago
Hi,
Sorry about the delay regarding finetuned checkpoints, I'll try to release them soon.
Regarding the usage of mixed-res ViTs for segmentation, I have 2 ideas, let me know if they're helpful:
For each token embedding, predict a segmentaion mask that is the size of the original patch – 16x16, 32x32 or 64x64 pixels.
Alternatively, perform a "reverse Quadtree" operation at the end of the ViT, which creates a token volume of size [image_size/16, image_size/16, embed_dim] where each token embedding is duplicated to completely fill its corresponding space, e.g. tokens that represent a 64x64 patch will be duplicated 16 times to create a 4x4 token grid. I have code that does that in JAX, I can convert it to PyTorch and release it if it's useful to you.
On Thu, 24 Aug 2023, 20:02 Rui Hu, @.***> wrote:
Hi, thanks for this elegant work.
I am working on a weakly supervised semantic segmentation(WSSS) project and it will be interesting to substitute the default patch embed modules with the mix-resolution-vit. WSSS methods usually depend on the CAM generated by the model where the patch tokens have to be reshaped back to the shape of image features(in patch scale) if ViT-based models are employed. Since a token of mix-resolution-vit represents a patch of arbitrary size, it is unnatural to perform such flattening process. I am wondering if you could share some experiences to address this issue.
By the way, it would be appreciated if the fine-tuned checkpoints could be shared. Thanks in advance!
— Reply to this email directly, view it on GitHub https://github.com/TomerRonen34/mixed-resolution-vit/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJEJEUKG3TWT7UGHWBLXST3XW6CLHANCNFSM6AAAAAA35KKR4Y . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for your reply! I will definitely try these two methods. I have tried an approach similar to the "reverse Quadtree”, which did not work well maybe due to my incorrect implementation. I am looking forward to your "reverse Quadtree” code.
Hi Rui, I released an efficient torch implementation of the Reverse Quadtree operation, including a sanity check that validates correctness. Check them out at mixed_res/quadtree_impl/reverse_quadtree.py and examples/04_reverse_quadtree.ipynb, and please let me know if it's useful :)
On Fri, Aug 25, 2023 at 5:10 AM Rui Hu @.***> wrote:
Thanks for your reply! I will definitely try these two methods. I have tried an approach similar to the "reverse Quadtree”, which did not work well maybe due to my incorrect implementation. I am looking forward to your "reverse Quadtree” code.
— Reply to this email directly, view it on GitHub https://github.com/TomerRonen34/mixed-resolution-vit/issues/2#issuecomment-1692655049, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJEJEUISASD3MYK6O6QUMXLXXACSLANCNFSM6AAAAAA35KKR4Y . You are receiving this because you commented.Message ID: @.***>
Hi, thanks for this elegant work.
I am working on a weakly supervised semantic segmentation(WSSS) project and it will be interesting to substitute the default patch embed modules with the mix-resolution-vit. WSSS methods usually depend on the CAM generated by the model where the patch tokens have to be reshaped back to the shape of image features(in patch scale) if ViT-based models are employed. Since a token of mix-resolution-vit represents a patch of arbitrary size, it is unnatural to perform such flattening process. I am wondering if you could share some experiences to address this issue.
By the way, it would be appreciated if the fine-tuned checkpoints could be shared. Thanks in advance!