[!CAUTION] This repo is under development. No hyper parameter tuning is presented yet here; hence, the current architecture is not optimal for deepfake detection.
This repo is the implementation for the paper 2D3MF: Deepfake Detection using Multi Modal Middle Fusion.
.
├── assets # Images for README.md
├── LICENSE
├── README.md
├── MODEL_ZOO.md
├── CITATION.cff
├── .gitignore
├── .github
# below is for the PyPI package marlin-pytorch
├── src # Source code for marlin-pytorch and audio feature extractors
├── tests # Unittest
├── requirements.lib.txt
├── setup.py
├── init.py
├── version.txt
# below is for the paper implementation
├── configs # Configs for experiments settings
├── TD3MF # 2D3MF model code
├── preprocess # Preprocessing scripts
├── dataset # Dataloaders
├── utils # Utility functions
├── train.py # Training script
├── evaluate.py # Evaluation script
├── requirements.txt
Install 2D3MF from pypi
pip install 2D3MF
Sample code snippet for feature extraction
from TD3MF.classifier import TD3MF
ckpt = "ckpt/celebvhq_marlin_deepfake_ft/last-v72.ckpt"
model = TD3MF.load_from_checkpoint(ckpt)
features = model.feature_extraction("2D3MF_Datasets/test/SampleVideo_1280x720_1mb.mp4")
We have some pretrained marlin checkpoints and configurations [here]()
Requirements:
Install PyTorch from the official website
Clone the repo and install the requirements:
git clone https://github.com/aiden200/2D3MF
cd 2D3MF
pip install -e .
We recommend using the following unified dataset structure
2D3MF_Dataset/
├── DeepfakeTIMIT
│ ├── audio/*.wav
│ └── video/*.mp4
├── DFDC
│ ├── audio/*.wav
│ └── video/*.mp4
├── FakeAVCeleb
│ ├── audio/*.wav
│ └── video/*.mp4
├── Forensics++
│ ├── audio/*.wav
│ └── video/*.mp4
├── RAVDESS
├── audio/*.wav
└── video/*.mp4
Crop the face region from the raw video. Run:
python3 preprocess/preprocess_clips.py --data_dir [Dataset_Dir]
Run:
python preprocess/extract_features.py --data_dir /path/to/data --video_backbone [VIDEO_BACKBONE] --audio_backbone [AUDIO_BACKBONE]
[VIDEO_BACKBONE] can be replaced with one of the following:
[AUDIO_BACKBONE] can be replaced with one of the following:
Optionally add the --Forensics
flag in the end if Forensics++ is the dataset being processed.
From our paper, we found that eat
works the best as the audio backbone.
Split the train val and test sets. Run:
python preprocess/gen_split.py --data_dir /path/to/data --test 0.1 --val 0.1 --feat_type [AUDIO_BACKBONE]
Note that the pre-trained video_backbone
and audio_backbone
can be downloaded from MODEL_ZOO.md
Train and evaluate the 2D3MF model..
Please use the configs in config/*.yaml
as the config file.
python evaluate.py \
--config /path/to/config \
--data_path /path/to/CelebV-HQ
--num_workers 4
--batch_size 16
python evaluate.py \
--config /path/to/config \
--data_path /path/to/dataset \
--num_workers 4 \
--batch_size 8 \
--marlin_ckpt pretrained/marlin_vit_base_ytf.encoder.pt \
--epochs 300
python evaluate.py --config config/celebvhq_marlin_deepfake_ft.yaml --data_path 2D3MF_Datasets --num_workers 4 --batch_size 1 --marlin_ckpt pretrained/marlin_vit_small_ytf.encoder.pt --epochs 300
Optionally, add
--skip_train --resume /path/to/checkpoint
To skip the training.
Set a configuration file based on your hyperparameters and backbones. You can find a example config file under config/
Explanation:
training_datasets
- list, can contain one or more datasets within "DeepfakeTIMIT"
, "RAVDESS"
, "Forensics++"
, "DFDC"
, "FakeAVCeleb"
eval_datasets
- list, can contain one or more datasets within "DeepfakeTIMIT"
, "RAVDESS"
, "Forensics++"
, "DFDC"
, "FakeAVCeleb"
learning_rate
- int, ex: 1.00e-3
num_heads
- int, Number of attention headsfusion
- str, Choice of fusion type: "mf"
for middle fusion and "lf"
for late fusion.audio_positional_encoding
- bool, add audio positional encodinghidden_layers
- int, hidden layerslp_only
- bool, setting this to be true will perform inference from the video features onlyaudio_backbone
- str, select one of the following options: "MFCC"
, "eat"
, "xvectors"
, "resnet"
, "emotion2vec"
middle_fusion_type
- str, select one of the following options: "default"
, "audio_refuse"
, "video_refuse"
, "self_attention"
, "self_cross_attention"
modality_dropout
- float, modality dropout ratevideo_backbone
- str, select one of the following options: "efficientface"
, "marlin"
Run
tensorboard --logdir=lightning_logs/
Should be hosted on http://localhost:6006/
This project is under the CC BY-NC 4.0 license. See LICENSE for details.
Please cite our work!
Some code about model is based on ControlNet/MARLIN. The code related to middle fusion is from Self-attention fusion for audiovisual emotion recognition with incomplete data.
Our Audio Feature Extraction Models:
Our Video Feature Extraction Models: