This repo contains an official PyTorch implementation of our paper: IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer.
Ubuntu LTS 20.04.1
CUDA 11.7 + cudnn 8.4.0
Python 3.8
PyTorch 1.11
Currently, You can follow the tutorial to experience the running pipeline of IML-ViT. The only difference from the Colab version is the lack of a playground for testing online images.
pip install -r requirements.txt
first../checkpoints/iml-vit_checkpoint.pth
. Now training code for IML-ViT is released!
First, you may prepare the dataset to fit the protocol of our dataloader for a quick start. Or, you can design your dataloader and modify the corresponding interfaces.
json_dataset
, which gets input image and corresponding ground truth from a JSON file with a protocol like this:
[
[
"/Dataset/CASIAv2/Tp/Tp_D_NRN_S_N_arc00013_sec00045_11700.jpg",
"/Dataset/CASIAv2/Gt/Tp_D_NRN_S_N_arc00013_sec00045_11700_gt.png"
],
......
[
"/Dataset/CASIAv2/Au/Au_nat_30198.jpg",
"Negative"
],
......
]
where "Negative" represents a totally black ground truth that doesn't need a path (all authentic)
mani_dataset
which loads images and ground truth pairs automatically from a directory having sub-directories named Tp
(for input images) and Gt
(for ground truths). This class will generate the pairs using the sorted os.listdir()
function. You can take images\sample_iml_dataset
as an example.edge_mask
when specifying the edge_width
parameter. Then, this dataset will return 3 objects (image, GT, edge mask) while only 2 objects when edge_width=None
.if_return_shape=True
to get this value. Thus, you may revise your dataset like mani_dataset
or generate a json
file for each dataset you are willing to train or test. We have prepared the Naive IML transforms class and edge mask generator class. You can call them directly using json_dataset
or mani_dataset
in ./utils/datasets.py to check if your revising is correct.
You may follow the instructions to download the Masked Autoencoder pre-trained weights before training. Thanks for their impressive work and their open-source contributions!
The main entrance is main_train.py, you may use the following script to call training on Linux:
torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node=1 \
main_train.py \
--world_size 1 \
--batch_size 1 \
--data_path "<Your custom dataset path>/CASIA2.0" \
--epochs 200 \
--lr 1e-4 \
--min_lr 5e-7 \
--weight_decay 0.05 \
--edge_lambda 20 \
--predict_head_norm "BN" \
--vit_pretrain_path "<Your path to pre-trained weights >/mae_pretrain_vit_base.pth" \
--test_data_path "<Your custom dataset path>/CASIA1.0" \
--warmup_epochs 4 \
--output_dir ./output_dir/ \
--log_dir ./output_dir/ \
--accum_iter 8 \
--seed 42 \
--test_period 4 \
--num_workers 4 \
2> train_error.log 1>train_log.log
data_path
is for training dataset
test_data_path
is for testing dataset during the training process
vit_pretrain_path
is the path for MAE pre-trained ViT weights
You should modify the path in <>
to your custom path. The default settings are generally recommended training parameters, but if you have a more powerful device, increasing the batch size and adjusting other parameters appropriately is also acceptable.
Note that we observed that the predict_head_norm
parameter, i.e. norm type of the predict_head may greatly influence the performance of the model. Some conclusions are here:
We tested three different types of normalization in the decoder head, and they may yield different results due to dataset configurations and other factors. Some intuitive conclusions are as follows:
- "LN" -> Layer norm : The fastest convergence, but poor generalization performance.
- "BN" Batch norm : When include authentic images during training, set batchsize = 2 may have poor performance. But if you can train with larger batchsize (e.g. NVIDIA A40 with 48GB memory can train with batchsize = 4) It may performs better.
- "IN" Instance norm : A form that can definitely converge, equivalent to a batchnorm with batchsize=1. When abnormal behavior is observed with BatchNorm, one can consider trying Instance Normalization. It's important to note that in this case, the settings of
nn.InstanceNorm2d
should include settingtrack_running_stats
andaffine
to True, rather than the default settings in PyTorch.
Anyway, We sincerely welcome to report other strange/shocking findings among the parameter settings in the issue. This can contribute to a more comprehensive understanding of the inherent properties of IML-ViT in the research community.
For more information, you may use python main_train.py -h
to see the full help list of the command arguments.
We recommend you monitor the training process with the following measures:
train_log.log
file. If the training proceeds correctly, you can check the latest status at the end of this file../output_dir
with the command tensorboard --logdir ./
. Then you can see the statistics and graphs with Internet Explorer.You can use our Colab demo or offline demo to check the performance of our powerful IML-ViT model. The only difference is to replace the default checkpoint with your own.
If you want to train this Model with the CASIAv2 dataset, we provide a revised version of CASIAv2 datasets, which corrected several mistakes in the original datasets provided by the author. Details can be found in the link shown below:
If you find our work interesting or helpful, please don't hesitate to give us a star๐ and cite our paper๐ฅฐ! Your support truly encourages us!
@misc{ma2023imlvit,
title={IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer},
author={Xiaochen Ma and Bo Du and Zhuohang Jiang and Ahmed Y. Al Hammadi and Jizhe Zhou},
year={2023},
eprint={2307.14863},
archivePrefix={arXiv},
primaryClass={cs.CV}
}