Open derow0208 opened 3 weeks ago
Thanks for your interest in our work!
The feature maps here refer to the output tokens from a Transformer block. Recall that ViT is a stack of Transformer blocks that looks like this:
We reshape the 1D output from Block-i back to 2D to obtain feature maps.
so you mean at first every image patch is projected to token with the shape of 1xD, and you just reprojected the output token from 1xD to a patch of the image, am i right?
Excellent work!!! But i'm little confused about how are the feature maps (before PCA visualization) generated in the project? Is it by extracting the attention score matrix of each layer?