FayKong / PPSN2020

Improving Imbalanced Classification by Anomaly Detection
GNU General Public License v3.0
2 stars 1 forks source link

Improving Imbalanced Classification by Anomaly Detection

This repository contains the code for introducing two additional attributes to imbalanced datasets and thus improving the imbalanced classification.

DOI

Introduction

The provided .py file is for the convenience of reproducing the main results of our paper Improving Imbalanced Classification by Anomaly Detection (Kong, Kowalczyk, Menzel & Bäck, 2020). The code is implemented in Python 3.0 and the packages required for implementation are given in the next section.

The data used in this paper contains two parts, 2D chess data and imbalanced benchmark datasets. The codes to experiment with the 2D chess dataset are given in folder 2D chess. chess_data_generate.py in this folder is for generating the 2D chess data used in the paper. The codes to experiment with benchmark datasets are given in folder Imbalanced Benchmark and the benchmark datasets are given in folder data. *_four_types.py files in folder 2D chess and Imbalanced Benchmark provide the code to identify the four types of samples, and *_LOF.py files provide the code to calculate the LOF score for the given datasets. *_added_*.py files provide the code to introduce the two additional attributes and *_added_resampling_*.py files provide the code to resample the datasets with two additional attributes. The experimental results are given in folder results. In the following, we will decribe the detailed technical requiremnets and how to run our code step by step.

Requirements

Scikit-Learn and Numpy software packages are used to obtain the two proposed additional attributes (LOF score and four types of samples). Scikit-Learn software package is also used for performing stratified cross-validation and classification. Imbalanced Learn software package provides the resampling techniques to resample the imbalanced datasets. Matplotlib is used to visualize the 2d chess dataset in different scenarios. The four required libraries can be installed by running pip install -r requirements.txt from the main directory via the command line.

Packages Description
Imbalanced Learn For implementing resampling techniques, e.g. SMOTE, ADASYN etc.
Scikit-Learn For calculating LOF score, performing stratified CV and classification.
Numpy For efficiently dealing with data.
Matplotlib For plotting.

The introductions on how to run our code step by step will be given in the following sections.

1. Calculating the Two Proposed Additional Attributes

The first step in our experimental setup is to calculate the two additional attributes for every dataset. Running the *_four_types.py file in both '2D chess' and 'Imbalanced Benchmark' folders can achieve the first introduced additional attribute 'four types of samples', while running *_LOF.py in both folders can achieve the second additional attribute 'LOF score'.

2. Considering Different Combinations

After introducing the two additional attributes, we consider different scenarios of resampling techniques and whether to add additional attributes. This can be achieved by running *_DT/SVM.py, *_added_*.py, *_added_resampling_*.py files in both '2D chess' and 'Imbalanced Benchmark' folders. The experimental results are shown in folder results.

3. Visualization on 2D Chess Dataset

The chess_LOF_plot.py file is used to achieve the figure in our paper.

How to Cite

paper Reference

Kong, J., Kowalczyk, W., Menzel, S. and Bäck, T., 2020, September. Improving imbalanced classification by anomaly detection. In International Conference on Parallel Problem Solving from Nature (pp. 512-523). Springer, Cham.

BibTex Reference

@inproceedings{kong2020improving,
  title={Improving imbalanced classification by anomaly detection},
  author={Kong, Jiawen and Kowalczyk, Wojtek and Menzel, Stefan and B{\"a}ck, Thomas},
  booktitle={International Conference on Parallel Problem Solving from Nature},
  pages={512--523},
  year={2020},
  organization={Springer}
}

Acknowledgements

This research has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement number 766186 (ECOLE).