Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study

This repository contains the code used for our paper. The code performs the labelling and benchmarking for the CICIDS 2017 dataset after it has been processed by our modified version of the CICFlowMeter tool.

Note that all of this is research code.

If you use the code in this repository, please cite our paper:

        @inproceedings{engelen2021troubleshooting,
        title={Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study},
        author={Engelen, Gints and Rimmer, Vera and Joosen, Wouter},
        booktitle={2021 IEEE Security and Privacy Workshops (SPW)},
        pages={7--12},
        year={2021},
        organization={IEEE}
        }

An extended documentation of our paper can be found here.

How to use this repository

First, head over to the website of the CICIDS 2017 dataset and download the raw version of the dataset (PCAP file format). There are 5 files in total, one for each day.

Then, run our our modified version of the CICFlowMeter tool on the data obtained in the previous step:

Start the CICFlowMeter tool
Under the "NetWork" menu option, select "Offline"
For "Pcap dir", choose the directory containing the 5 PCAP files of the CICIDS 2017 dataset
For "Output dir", choose the "UnlabelledDataset" directory of this WTCM2021-Code project.
Keep the default values for the "Flow TimeOut" and "Activity Timeout" parameters (120000000 and 5000000 respectively)

This will generate 5 CSV files with the flows extracted from the raw PCAP files.

After this, verify the TIME_DIFFERENCE, INPUT_DIR, OUTPUT_DIR and PAYLOAD_FILTER_ACTIVE attributes in the labelling_CSV_flows.py script, and then run it (no need to specify any command-line options). This will label all the flows in the CSV files generated by the CICFlowMeter tool.

Then, run the MakeDataNumpyFriendly.py script, which will convert the labelled CSV files into a single numpy array. Note that, in our experiments, we chose to relabel all "Attempted" flows as BENIGN. If you wish to keep them separate, make sure to change the numerical labels in the convertToNumericalLabels(flows_list_of_dict) function.

Finally, run the Benchmarking_RF.py script to perform benchmarking on the dataset using a Random Forest classifier. Random seeds and various other options can be specified in the first few lines of the script.

GintsEngelen / WTMC2021-Code

readme

Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study

How to use this repository