TomerRonen34 / mixed-resolution-vit

45 stars 2 forks source link

Questions about applying mix-resolution-vit to other downstream tasks such as semantic segmentation. #2

Open UCASHurui opened 11 months ago

UCASHurui commented 11 months ago

Hi, thanks for this elegant work.

I am working on a weakly supervised semantic segmentation(WSSS) project and it will be interesting to substitute the default patch embed modules with the mix-resolution-vit. WSSS methods usually depend on the CAM generated by the model where the patch tokens have to be reshaped back to the shape of image features(in patch scale) if ViT-based models are employed. Since a token of mix-resolution-vit represents a patch of arbitrary size, it is unnatural to perform such flattening process. I am wondering if you could share some experiences to address this issue.

By the way, it would be appreciated if the fine-tuned checkpoints could be shared. Thanks in advance!

TomerRonen34 commented 11 months ago

Hi,

Sorry about the delay regarding finetuned checkpoints, I'll try to release them soon.

Regarding the usage of mixed-res ViTs for segmentation, I have 2 ideas, let me know if they're helpful:

  1. For each token embedding, predict a segmentaion mask that is the size of the original patch – 16x16, 32x32 or 64x64 pixels.

  2. Alternatively, perform a "reverse Quadtree" operation at the end of the ViT, which creates a token volume of size [image_size/16, image_size/16, embed_dim] where each token embedding is duplicated to completely fill its corresponding space, e.g. tokens that represent a 64x64 patch will be duplicated 16 times to create a 4x4 token grid. I have code that does that in JAX, I can convert it to PyTorch and release it if it's useful to you.

On Thu, 24 Aug 2023, 20:02 Rui Hu, @.***> wrote:

Hi, thanks for this elegant work.

I am working on a weakly supervised semantic segmentation(WSSS) project and it will be interesting to substitute the default patch embed modules with the mix-resolution-vit. WSSS methods usually depend on the CAM generated by the model where the patch tokens have to be reshaped back to the shape of image features(in patch scale) if ViT-based models are employed. Since a token of mix-resolution-vit represents a patch of arbitrary size, it is unnatural to perform such flattening process. I am wondering if you could share some experiences to address this issue.

By the way, it would be appreciated if the fine-tuned checkpoints could be shared. Thanks in advance!

— Reply to this email directly, view it on GitHub https://github.com/TomerRonen34/mixed-resolution-vit/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJEJEUKG3TWT7UGHWBLXST3XW6CLHANCNFSM6AAAAAA35KKR4Y . You are receiving this because you are subscribed to this thread.Message ID: @.***>

UCASHurui commented 11 months ago

Thanks for your reply! I will definitely try these two methods. I have tried an approach similar to the "reverse Quadtree”, which did not work well maybe due to my incorrect implementation. I am looking forward to your "reverse Quadtree” code.

TomerRonen34 commented 10 months ago

Hi Rui, I released an efficient torch implementation of the Reverse Quadtree operation, including a sanity check that validates correctness. Check them out at mixed_res/quadtree_impl/reverse_quadtree.py and examples/04_reverse_quadtree.ipynb, and please let me know if it's useful :)

On Fri, Aug 25, 2023 at 5:10 AM Rui Hu @.***> wrote:

Thanks for your reply! I will definitely try these two methods. I have tried an approach similar to the "reverse Quadtree”, which did not work well maybe due to my incorrect implementation. I am looking forward to your "reverse Quadtree” code.

— Reply to this email directly, view it on GitHub https://github.com/TomerRonen34/mixed-resolution-vit/issues/2#issuecomment-1692655049, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJEJEUISASD3MYK6O6QUMXLXXACSLANCNFSM6AAAAAA35KKR4Y . You are receiving this because you commented.Message ID: @.***>