Jiawei-Yang / Denoising-ViT

This is the official code release for our work, Denoising Vision Transformers.
MIT License
319 stars 8 forks source link

Feature Map Generation #12

Open derow0208 opened 3 weeks ago

derow0208 commented 3 weeks ago

Excellent work!!! But i'm little confused about how are the feature maps (before PCA visualization) generated in the project? Is it by extracting the attention score matrix of each layer?

Jiawei-Yang commented 3 weeks ago

Thanks for your interest in our work!

The feature maps here refer to the output tokens from a Transformer block. Recall that ViT is a stack of Transformer blocks that looks like this: image

We reshape the 1D output from Block-i back to 2D to obtain feature maps.

derow0208 commented 3 weeks ago

so you mean at first every image patch is projected to token with the shape of 1xD, and you just reprojected the output token from 1xD to a patch of the image, am i right?