CIFAKE: Comparing Classifiers on FAKE vs. REAL Images

Repository link: https://github.com/MinaAlmasi/CIFAKE-image-classifiers

This repository forms the self-assigned assignment 4 by Mina Almasi (202005465) in the subject Visual Analytics, Cultural Data Science, F2023.

The repository aims to investigate the utility of artificially generated images as an alternative to data augmentation when training classifiers to predict real life images. For this purpose, the CIFAKE dataset (Bird & Lofti, 2023) is used.

Data

The CIFAKE dataset contains 60,000 images that are synthetically generated to be equivalent to the CIFAR-10 dataset along with 60,000 original CIFAR-10 images (Krizhevsky, 2009). The synthetically generated images were created using the text-to-image Stable Diffusion v1-4. Examples of these artificial images are shown below.

Figure by Bird & Lofti (2023)

Experimental Pipeline and Motivation

The first step when investigating any cultural project with image analysis is to acquire the data needed to answer our questions. However, in problems such as classification which require an abundance of data, this can become problematic if access to data is limited. This is usually approached by data augmentation which refers to the creation of new, slightly modified versions of existing data (e.g., by rotating, cropping, flipping the images). With the emergence of generative image models, it is relevant to explore the utility of artificially generated images as an alternative to data augmentation.

Therefore, this project concretely aims to assess whether the CIFAKE artificial images can be used to train classifiers that would also perform well on the CIFAR-10 images.

For this purpose, two experiments are conducted:

`(E1)` Training Classifiers on `REAL` vs `FAKE` Data

In experiment 1, three classifiers will be trained for each dataset (FAKE and REAL ) seperately using TensorFlow. These classifiers increase in complexity:

Simple Neural Network
CNN with the LeNet Architecture (See also Wiki/LeNet)
Pre-trained VGG-16.

`(E2)` Testing `FAKE` Classifiers on `REAL` Test Data

In experiment 2, the best performing FAKE classifier will be evaluated on the REAL test dataset to see whether its performance transfers across datasets.

Reproducibility

To reproduce the results, follow the instructions in the Pipeline section.

NB! Be aware that training models is computationally heavy. Cloud computing (e.g., UCloud) with high amounts of ram (or a good GPU) is encouraged.

Project Structure

The repository is structured as such:

	Description
`E1_results`	Results from experiment 1 (E1): model histories, individual loss/accuracy curves, evaluation metrics.
`E1_visualisations`	Visualisations made on results from experiment 1 (E1).
`E2_results` \| Results from experiment 2 (E2): evaluation metrics of two `FAKE` classifiers on the `REAL` test data.
`E2_visualisations`	Visualisations made on results from from experiment 2 (E2).
`src`	Scripts for creating metadata for the dataset, running classifications, creating visualisations, and for doing the final evaluation.
`requirements.txt`	Necessary packages to be installed.
`setup.sh` \| Run to install `requirements.txt` within newly created `env`.
`run.sh`	Run to reproduce entire pipeline including creating metadata, running classifications, evaluating classifiers, making visualisations.
`run-X.sh`	3 seperate bash scripts to run only the model training and evaluation (E1).

Pipeline

The pipeline has been tested on Ubuntu v22.10, Python v3.10.7 (UCloud, Coder Python 1.77.3). Python's venv needs to be installed for the pipeline to work.

Setup

Prior to running the pipeline, please firstly install the CIFAKE dataset from Kaggle. Ensure that the data follows the structure and naming conventions described in images/README.md.

Secondly, create a virtual environment (env) and install necessary requirements by running:

bash setup.sh

Running Experimental Pipeline

To run the entire experimental pipeline, type the following in the terminal:

bash run.sh

Training Models Seperately

If you wish to run the model training and evaluation for each model framework seperately, you can run the run-X.sh scripts. For instance:

bash run-VGG16.sh

Results

The results are shown below. Please note that the model prefix FAKE or REAL refers to whether the model has been trained on the FAKE or REAL dataset.

(`E1`) Loss and Accuracy Curves

For the loss and accuracy curves below, it is worth noting that the six models have not run for the same amount of epochs due to a strict early-stopping callback, making the model training stop if the validation accuracy does not improve for more than 2 epochs.

Neural Network

LeNet

VGG16

In general, the LeNet and NN seem to fit well to the data in comparison to the VGG16 that shows signs of overfitting with the training loss continously dropping while the validation loss is increasing slightly. Although the REAL LeNet also shows signs of this (with a spike upward in validation loss at the 8th epoch and also the last epoch), it is less prominent.

(`E1`) Evaluation Metrics: F1-score

The F1 scores (and the single Accuracy score) for all models is shown in the table below. For precision and recall metrics, please check the individual metrics.txt files in the E1_results folder.

	Airplane	Automobile	Bird	Cat	Deer	Dog	Frog	Horse	Ship	Truck	Accuracy	Macro_Avg	Weighted_Avg	Epochs
REAL VGG16	0.65	0.69	0.52	0.48	0.54	0.57	0.67	0.65	0.72	0.68	0.62	0.62	0.62	10
FAKE VGG16	0.86	0.87	0.84	0.78	0.91	0.73	0.94	0.87	0.84	0.85	0.85	0.85	0.85	13
FAKE LeNet	0.86	0.89	0.8	0.77	0.89	0.7	0.95	0.84	0.82	0.87	0.84	0.84	0.84	11
REAL LeNet	0.68	0.75	0.47	0.48	0.58	0.48	0.72	0.71	0.74	0.69	0.63	0.63	0.63	18
REAL NN	0.36	0.45	0.29	0.21	0.32	0.34	0.36	0.41	0.46	0.46	0.37	0.37	0.37	20
FAKE NN	0.55	0.74	0.58	0.52	0.67	0.43	0.55	0.55	0.61	0.63	0.59	0.58	0.58	20

Overall, the macro averaged F1 scores are higher for the FAKE dataset. It may be that the dataset is less complex/noisy.

(`E2`) Evaluating `FAKE` Classifiers on `REAL` Test Data

Since the FAKE LeNet (macro avg F1 = 0.84) and FAKE VGG16 (macro avg F1 = 0.85) performed similarly, both are evaluated on the REAL CIFAR-10 test dataset. The table below shows the F1-scores:

	Airplane	Automobile	Bird	Cat	Deer	Dog	Frog	Horse	Ship	Truck	Accuracy	Macro_Avg	Weighted_Avg	Epochs
FAKE LeNet	0.38	0.39	0.33	0.28	0.27	0.3	0.11	0.41	0.56	0.46	0.36	0.35	0.35	11
FAKE VGG16	0.46	0.44	0.37	0.34	0.37	0.39	0.17	0.48	0.57	0.53	0.42	0.41	0.41	18

Interestingly, the FAKE VGG16 (macro avg = 0.42) tested on the REAL data outperforms the REAL NN (macro avg = 0.37). This performance is surprising, considering the loss curves of VGG16 showing signs of overfitting. A possible explanation is to be found in the fact that VGG16 is pre-trained and likely contains image embeddings equivalent/close to the 10 classes, making it an easier task to fit a classifier with VGG16. Although the FAKE models do not outperform the other real models (REAL LeNetand REAL VGG16), their performance being well above chance level for most classes is quite significant and looks promising for the use of artificial images as an alternative to data augmentation.

Author

This repository was created by Mina Almasi:

github user: @MinaAlmasi
student no: 202005465, AUID: au675000
mail: mina.almasi@post.au.dk

References

Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images.

Bird, J.J., Lotfi, A. (2023). CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. arXiv preprint https://arxiv.org/abs/2303.14126

MinaAlmasi / CIFAKE-image-classifiers

readme

CIFAKE: Comparing Classifiers on FAKE vs. REAL Images

Data

Experimental Pipeline and Motivation

`(E1)` Training Classifiers on `REAL` vs `FAKE` Data

`(E2)` Testing `FAKE` Classifiers on `REAL` Test Data

Reproducibility

Project Structure

Pipeline

Setup

Running Experimental Pipeline

Training Models Seperately

Results

(`E1`) Loss and Accuracy Curves

Neural Network

LeNet

VGG16

(`E1`) Evaluation Metrics: F1-score

(`E2`) Evaluating `FAKE` Classifiers on `REAL` Test Data

Author

References

MinaAlmasi / CIFAKE-image-classifiers

readme

CIFAKE: Comparing Classifiers on FAKE vs. REAL Images

Data

Experimental Pipeline and Motivation

(E1) Training Classifiers on REAL vs FAKE Data

(E2) Testing FAKE Classifiers on REAL Test Data

Reproducibility

Project Structure

Pipeline

Setup

Running Experimental Pipeline

Training Models Seperately

Results

(E1) Loss and Accuracy Curves

Neural Network

LeNet

VGG16

(E1) Evaluation Metrics: F1-score

(E2) Evaluating FAKE Classifiers on REAL Test Data

Author

References

`(E1)` Training Classifiers on `REAL` vs `FAKE` Data

`(E2)` Testing `FAKE` Classifiers on `REAL` Test Data

(`E1`) Loss and Accuracy Curves

(`E1`) Evaluation Metrics: F1-score

(`E2`) Evaluating `FAKE` Classifiers on `REAL` Test Data