davide-coccomini / Combining-EfficientNet-and-Vision-Transformers-for-Video-Deepfake-Detection

Code for Video Deepfake Detection model from "Combining EfficientNet and Vision Transformers for Video Deepfake Detection" presented at ICIAP 2021.
https://dl.acm.org/doi/abs/10.1007/978-3-031-06433-3_19
MIT License
239 stars 60 forks source link

How to train crossvit on b5 or higher? #27

Closed winterfell2021 closed 2 years ago

winterfell2021 commented 2 years ago

I simply changed the efficientnet name in code, it loads b5 pretrained weight successfully. But while training it stuck at x = self.to_patch_embedding(x) in forward function. Thanks for your great work and help!

davide-coccomini commented 2 years ago

Hi @winterfell2021! In order to make the models work with a specific type of EfficientNet, we adapted the number of patches and their size in input to the Vision Transformer, based on the outputs of the intermediate layers of the EfficientNet. We did this work for the version with the classic Vision Transformer using both EfficientNet B0 and EfficientNet B7, and for the Cross Vision Transformers, only with EfficientNet B0.

If you want to make it work with an EfficientNet B5 you inevitably have to change the internal model code so that all dimensions match.

winterfell2021 commented 2 years ago

Thanks for you reply! Since I am a beginner, could u just give some advice or some examples or some relative things on how to change the internal model code so that all dimensions match. Thanks again!

davide-coccomini commented 2 years ago

The different types of EfficientNet have a different number of layers and transform the input image into images of different sizes and channels. The first thing to do is to try to figure out the size of the intermediate images created by the EfficientNet you want to use. You can find many details in this simple article: https://towardsdatascience.com/complete-architectural-details-of-all-efficientnet-models-5fd5b736142

The deeper you go into the network the more channels there are and the smaller the images are. Translated in terms of transformers, the deeper the layer from which we extract the images, the more patches but smaller in size.

For example, if I am using EfficientNet B0 I can have 56x56 patches by extracting features at stage 4 as seen in the table in the Medium article. If I wanted to have smaller patches, say 14x14, I could extract them at stage 7.

If you can't find the architectural details of the network you want to use, you can also simply do some code tests.

To extract the features (and therefore the patches) to a given stage we have created a function called "extract_features_at_block" which you can find here: https://github.com/davide-coccomini/Combining-EfficientNet-and-Vision-Transformers-for-Video-Deepfake-Detection/blob/main/cross-efficient-vit/efficient_net/efficientnet_pytorch/model.py

Once you know which block you want to extract the patches from, just specify it as we did on lines 268 and 269 of the Cross Efficient Vision Transformer (https://github.com/davide-coccomini/Combining-EfficientNet-and-Vision-Transformers-for-Video-Deepfake-Detection/blob/main/cross-efficient-vit/cross_efficient_vit.py) and call extract_features_at_block in the ImageEmbedder.

This process can also be done with Efficient Vision Transformer but it will require you to make more changes to our code because, in this case, we did not need to extract features to a specific block but only from the last one and so the code was not meant to work like that.

I hope my suggestions will be useful for your research.

winterfell2021 commented 2 years ago

@davide-coccomini Thanks for your kind suggestion, helps me a lot! I wonder if u have any sponsor ways, I can buy u a coffee.