Closed 1amrutesh closed 1 year ago
I'm not sure if your issue has been resolved. Switching to Backbone for feature extraction is a good idea, but we have only conducted experiments on CNN-based models. If you want to experiment with Swin Transformer V2, I suggest that you also use combinations of different layers. As for which specific layers to use, this would require more experimentation on your part. We suspect that the features from intermediate layers should be more effective for anomaly detection.
Based on the code you provided, it seems that the _embed method has not been changed much. Could you please provide more detailed information, including the location where the error occurs? Modifications to the PatchMaker need to be consistent with the characteristics of the sliding window. Its main function is to divide the features into blocks.
Based on your error message, it seems to be a simple tensor mismatch issue due to different total element numbers. I guess this is caused by the sliding window mechanism of the Swin Transformer, where a single extracted feature in the intermediate layer is only a part of the entire image, so it cannot be adjusted to the scale of the entire image features.
Description: I am trying to modify the SoftPatch implementation to use a vision transformer-based architecture like Swin Transformer V2 instead of the WideResNet-50 for the feature extraction step. I have encountered some challenges and have questions about how the SoftPatch class must be modified to achieve the correct feature extraction using the Swin Transformer V2 model.
Background:
In the original implementation, the WideResNet-50 model is used as the feature extractor. I want to replace it with the Swin Transformer V2 model to leverage the benefits of vision transformers for feature extraction. I have imported the Swin Transformer V2 model and modified the SoftPatch class, but I am not certain if the modifications are logically accurate and compatible with the Swin Transformer V2 model.
Errors and Questions:
In the load method, I have set the feature aggregator to the Swin Transformer V2 model. I am not sure if this is the correct way to integrate the model into the SoftPatch class. Are there any other changes required to properly use the Swin Transformer V2 model as a feature extractor in the SoftPatch class?
The original implementation uses the WideResNet-50 model for feature extraction, and the features are extracted from certain layers. I have set the layers_to_extract_from parameter to ("patch_embed.proj",), which is a Swin Transformer V2 specific layer. Is this the correct layer to extract features from, or should I use a different layer or a combination of layers for feature extraction?
In the _embed method, I have adjusted the feature handling for Swin Transformer V2 patch embeddings. Is this the right way to handle features extracted from the Swin Transformer V2 model? Are there any additional steps required to process the features before passing them to the preprocessing module and the preadapt_aggregator?
The SoftPatch class uses the PatchMaker class for patching and unpatching images. I have modified the PatchMaker class to use the Swin Transformer V2's patch embedding. Is this the correct way to patch and unpatch images using the Swin Transformer V2 model, or are there additional modifications required in the PatchMaker class?
Are there any other changes or modifications needed in the SoftPatch class or any other related classes to ensure compatibility with the Swin Transformer V2 model for feature extraction?
I have provided the modified SoftPatch class below for reference:
I get the error RuntimeError: shape '[8, 36, 1536, 36, 3, 3]' is invalid for input of size 3981312. I am using the swinv2_large_window12_192_22k variant from the timm library. I have also resized the images to the size 192 to match with the swin architecture. I am using the MVTec-AD dataset. I sincerely appreciate any help or guidance you can provide in modifying the SoftPatch class to use the Swin Transformer V2 model for feature extraction.