Use image_features instead of patch_tokens

ByChelsea / VAND-APRIL-GAN

[CVPR 2023 Workshop] VAND Challenge: 1st Place on Zero-shot AD and 4th Place on Few-shot AD

162 stars 20 forks source link

Hi, thanks for contributing nice work. Here I have a question for discussion.

Question: How can we use image_feature (in your train.py line 112) instead of patch_tokens with ResNet50 backbone. And do you have any suggestions on how to achieve this?

In the original code (with ResNet50 backbone), you are using different scale patch_tokens to element-wise multiply text_feature with shape: (B, 9612, 768) and (B, 768, 2) => (B, 9612, 2) (B, 2304, 768) and (B, 768, 2) => (B, 2304, 2) (B, 576, 768) and (B, 768, 2) => (B, 576, 2) and reshape, interpolate to target anomaly map size, and so on...

But the image_features shape is (B, 768) and the text_features shape is (B, 768, 2). How should we modify and design the rest actions to continue to train linear layers and generate anomaly maps for inference?

If you have any questions, feel free to ask, thanksss!

Hi, I'm glad you're interested in our work.

Image_features contain the global information of an image, which is the sum of all the local detailed information. To perform anomaly segmentation, specific detailed information at each position in the image is often required, which _cannot be directly reflected through image_features alone_.

Therefore, I believe that it is not possible to obtain anomaly maps solely through image_features and text_features. However, image_features do contain certain valuable information. You can consider how to use this information in addition, instead of completely discarding patch_tokens.

ByChelsea / VAND-APRIL-GAN

Use image_features instead of patch_tokens #5