MinaAlmasi / CIFAKE-image-classifiers

Investigating the utility of artificially generated images as an alternative to data augmentation. Self-assigned Assignment 4, Visual Analytics, Cultural Data Science F2023.
1 stars 0 forks source link

CIFAKE: Comparing Classifiers on FAKE vs. REAL Images

Repository link: https://github.com/MinaAlmasi/CIFAKE-image-classifiers

This repository forms the self-assigned assignment 4 by Mina Almasi (202005465) in the subject Visual Analytics, Cultural Data Science, F2023.

The repository aims to investigate the utility of artificially generated images as an alternative to data augmentation when training classifiers to predict real life images. For this purpose, the CIFAKE dataset (Bird & Lofti, 2023) is used.

Data

The CIFAKE dataset contains 60,000 images that are synthetically generated to be equivalent to the CIFAR-10 dataset along with 60,000 original CIFAR-10 images (Krizhevsky, 2009). The synthetically generated images were created using the text-to-image Stable Diffusion v1-4. Examples of these artificial images are shown below.

Figure by Bird & Lofti (2023)

Experimental Pipeline and Motivation

The first step when investigating any cultural project with image analysis is to acquire the data needed to answer our questions. However, in problems such as classification which require an abundance of data, this can become problematic if access to data is limited. This is usually approached by data augmentation which refers to the creation of new, slightly modified versions of existing data (e.g., by rotating, cropping, flipping the images). With the emergence of generative image models, it is relevant to explore the utility of artificially generated images as an alternative to data augmentation.

Therefore, this project concretely aims to assess whether the CIFAKE artificial images can be used to train classifiers that would also perform well on the CIFAR-10 images.

For this purpose, two experiments are conducted:

(E1) Training Classifiers on REAL vs FAKE Data

In experiment 1, three classifiers will be trained for each dataset (FAKE and REAL ) seperately using TensorFlow. These classifiers increase in complexity:

  1. Simple Neural Network
  2. CNN with the LeNet Architecture (See also Wiki/LeNet)
  3. Pre-trained VGG-16.

(E2) Testing FAKE Classifiers on REAL Test Data

In experiment 2, the best performing FAKE classifier will be evaluated on the REAL test dataset to see whether its performance transfers across datasets.

Reproducibility

To reproduce the results, follow the instructions in the Pipeline section.

NB! Be aware that training models is computationally heavy. Cloud computing (e.g., UCloud) with high amounts of ram (or a good GPU) is encouraged.

Project Structure

The repository is structured as such:

Description
E1_results Results from experiment 1 (E1): model histories, individual loss/accuracy curves, evaluation metrics.
E1_visualisations Visualisations made on results from experiment 1 (E1).
E2_results | Results from experiment 2 (E2): evaluation metrics of two FAKE classifiers on the REAL test data.
E2_visualisations Visualisations made on results from from experiment 2 (E2).
src Scripts for creating metadata for the dataset, running classifications, creating visualisations, and for doing the final evaluation.
requirements.txt Necessary packages to be installed.  
setup.sh | Run to install requirements.txt within newly created env.
run.sh Run to reproduce entire pipeline including creating metadata, running classifications, evaluating classifiers, making visualisations.
run-X.sh 3 seperate bash scripts to run only the model training and evaluation (E1).

Pipeline

The pipeline has been tested on Ubuntu v22.10, Python v3.10.7 (UCloud, Coder Python 1.77.3). Python's venv needs to be installed for the pipeline to work.

Setup

Prior to running the pipeline, please firstly install the CIFAKE dataset from Kaggle. Ensure that the data follows the structure and naming conventions described in images/README.md.

Secondly, create a virtual environment (env) and install necessary requirements by running:

bash setup.sh

Running Experimental Pipeline

To run the entire experimental pipeline, type the following in the terminal:

bash run.sh

Training Models Seperately

If you wish to run the model training and evaluation for each model framework seperately, you can run the run-X.sh scripts. For instance:

bash run-VGG16.sh

Results

The results are shown below. Please note that the model prefix FAKE or REAL refers to whether the model has been trained on the FAKE or REAL dataset.

(E1) Loss and Accuracy Curves

For the loss and accuracy curves below, it is worth noting that the six models have not run for the same amount of epochs due to a strict early-stopping callback, making the model training stop if the validation accuracy does not improve for more than 2 epochs.

Neural Network

LeNet

VGG16

In general, the LeNet and NN seem to fit well to the data in comparison to the VGG16 that shows signs of overfitting with the training loss continously dropping while the validation loss is increasing slightly. Although the REAL LeNet also shows signs of this (with a spike upward in validation loss at the 8th epoch and also the last epoch), it is less prominent.

(E1) Evaluation Metrics: F1-score

The F1 scores (and the single Accuracy score) for all models is shown in the table below. For precision and recall metrics, please check the individual metrics.txt files in the E1_results folder.

Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck Accuracy Macro_Avg Weighted_Avg Epochs
REAL VGG16 0.65 0.69 0.52 0.48 0.54 0.57 0.67 0.65 0.72 0.68 0.62 0.62 0.62 10
FAKE VGG16 0.86 0.87 0.84 0.78 0.91 0.73 0.94 0.87 0.84 0.85 0.85 0.85 0.85 13
FAKE LeNet 0.86 0.89 0.8 0.77 0.89 0.7 0.95 0.84 0.82 0.87 0.84 0.84 0.84 11
REAL LeNet 0.68 0.75 0.47 0.48 0.58 0.48 0.72 0.71 0.74 0.69 0.63 0.63 0.63 18
REAL NN 0.36 0.45 0.29 0.21 0.32 0.34 0.36 0.41 0.46 0.46 0.37 0.37 0.37 20
FAKE NN 0.55 0.74 0.58 0.52 0.67 0.43 0.55 0.55 0.61 0.63 0.59 0.58 0.58 20

Overall, the macro averaged F1 scores are higher for the FAKE dataset. It may be that the dataset is less complex/noisy.

(E2) Evaluating FAKE Classifiers on REAL Test Data

Since the FAKE LeNet (macro avg F1 = 0.84) and FAKE VGG16 (macro avg F1 = 0.85) performed similarly, both are evaluated on the REAL CIFAR-10 test dataset. The table below shows the F1-scores:

Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck Accuracy Macro_Avg Weighted_Avg Epochs
FAKE LeNet 0.38 0.39 0.33 0.28 0.27 0.3 0.11 0.41 0.56 0.46 0.36 0.35 0.35 11
FAKE VGG16 0.46 0.44 0.37 0.34 0.37 0.39 0.17 0.48 0.57 0.53 0.42 0.41 0.41 18

Interestingly, the FAKE VGG16 (macro avg = 0.42) tested on the REAL data outperforms the REAL NN (macro avg = 0.37). This performance is surprising, considering the loss curves of VGG16 showing signs of overfitting. A possible explanation is to be found in the fact that VGG16 is pre-trained and likely contains image embeddings equivalent/close to the 10 classes, making it an easier task to fit a classifier with VGG16. Although the FAKE models do not outperform the other real models (REAL LeNetand REAL VGG16), their performance being well above chance level for most classes is quite significant and looks promising for the use of artificial images as an alternative to data augmentation.

Author

This repository was created by Mina Almasi:

References

Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images.

Bird, J.J., Lotfi, A. (2023). CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images. arXiv preprint https://arxiv.org/abs/2303.14126