This repository contains the code used for our paper. The code performs the labelling and benchmarking for the CICIDS 2017 dataset after it has been processed by our modified version of the CICFlowMeter tool.
Note that all of this is research code.
If you use the code in this repository, please cite our paper:
@inproceedings{engelen2021troubleshooting,
title={Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study},
author={Engelen, Gints and Rimmer, Vera and Joosen, Wouter},
booktitle={2021 IEEE Security and Privacy Workshops (SPW)},
pages={7--12},
year={2021},
organization={IEEE}
}
An extended documentation of our paper can be found here.
First, head over to the website of the CICIDS 2017 dataset and download the raw version of the dataset (PCAP file format). There are 5 files in total, one for each day.
Then, run our our modified version of the CICFlowMeter tool on the data obtained in the previous step:
This will generate 5 CSV files with the flows extracted from the raw PCAP files.
After this, verify the TIME_DIFFERENCE
, INPUT_DIR
, OUTPUT_DIR
and PAYLOAD_FILTER_ACTIVE
attributes in the
labelling_CSV_flows.py
script, and then run it (no need to specify any command-line options). This will label all the
flows in the CSV files generated by the CICFlowMeter tool.
Then, run the MakeDataNumpyFriendly.py
script, which will convert the labelled CSV files into a single numpy array.
Note that, in our experiments, we chose to relabel all "Attempted" flows as BENIGN. If you wish to keep them separate,
make sure to change the numerical labels in the convertToNumericalLabels(flows_list_of_dict)
function.
Finally, run the Benchmarking_RF.py
script to perform benchmarking on the dataset using a Random Forest classifier.
Random seeds and various other options can be specified in the first few lines of the script.