Consulting about the complete code

Ranjitkm2007 / SwinPLT

SwinPLT: Swin Transformer with Part-Level Tokenization for Occluded Person Re-identification

3 stars 0 forks source link

Consulting about the complete code #1

Closed shenhai911 closed 2 months ago

shenhai911 commented 3 months ago

Dear author, thank you for your excellent work. I would like to inquire when you plan to make all your code publicly available. I am looking forward to your reply. Thank you!

Ranjitkm2007 commented 3 months ago

Thank you for your enquiry.

As per suggestion by my PhD supervisor, I will make the code publicly available as soon as it gets selected for publication. However if there is any particular query, I will be glad to answer.

Regards Ranjit

On Sun, 21 Jul 2024 at 4:57 PM, shenhai911 @.***> wrote:

Dear author, thank you for your excellent work. I would like to inquire when you plan to make all your code publicly available. I am looking forward to your reply. Thank you!

— Reply to this email directly, view it on GitHub https://github.com/Ranjitkm2007/SwinPLT/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/A35OXEY2ZSQV7XEVGMWPKPTZNOLJRAVCNFSM6AAAAABLGYZ2P6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDCMZVGY4DGMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

shenhai911 commented 3 months ago

Thank you very much for your prompt reply! Recently, I have been conducting research based on the Swin Transformer. I encountered some difficulties while attempting to visualize attention maps. Since Swin Transformer uses local window-based attention mechanisms, I would like to ask for your advice on how to recover global attention from its local attention, in order to achieve the effect shown in the figure below. Would you be willing to share your code or insights on this part? Thank you very much!

Ranjitkm2007 commented 3 months ago

Can you please share details of how you are trying to capture the attention weights during forward pass.

shenhai911 commented 2 months ago

Hello, I am very sorry to reply to you so late. My question is that for the standard Swin Transformer, which contains four stages, I want to obtain the global attention after processing in each stage and superimpose it on the input image to visualize the changes in the attention area of each stage. However, for the input image of (3,224,224), the shape of the output tensor in each stage is (96,56,56), (192,28,28), (384,14,14) and (768,7,7). I have tried some methods to recover global attention from local window attention, but I have not been able to solve the problem. Therefore, I would like to consult you on how to achieve the effect in your paper, as shown in the following figure.

Ranjitkm2007 commented 2 months ago

The approach is to extract the attention scores (shape: (batch_size, num_heads, num_tokens, num_tokens)), which represent attention within each window, from each block. Aggregate all windows' attention maps to approximate a stage-wise attention map, then use bilinear interpolation to resize each stage's aggregated attention map to (224, 224).

Stage 1 attention map (56, 56) → Upsample to (224, 224) Stage 2 attention map (28, 28) → Upsample to (224, 224) Stage 3 attention map (14, 14) → Upsample to (224, 224) Stage 4 attention map (7, 7) → Upsample to (224, 224) This ensures that the attention maps from different stages can be overlaid on the original input image of size (224, 224).