Can you talk about how you train, the steps of training (model source), thank you

@fangyongcheng the training notebooks aren't well organized to publish yet, but this is the workflow we did :

Data cleaning

We noticed that there’s lot of errors in our data: starting from the fact the each column have the value “Label” like the the figure3 shows, we deleted the rows with this value, some columns of the data had static values for all rows (variance = 0) we deleted all of these columns, the dataset also had some nan (not a number) values in some columns, we deleted the rows with this value.

Data processing

We undersampled the dataset we used to make it smaller and more balanced, the original data is highly imbalanced, which will heavily affect our training, and the size of the dataset is too big to fit in any of the ram available for training

Data Splitting

We split the dataset into three different sets: training, validation(development) and a test set.

The training set is used to train the data only, we gave it a percentage of 94% of the overall dataset
The validation set is used to choose the model and to adjust its hyperparameters.
The test set is supposed to imitate real world, it’s used in the end of the machine learning workflow to evaluate the performance of the model on it, so the published results here are evaluated on the test set

Choosing a Model

Our approach is to divide the process into two steps or stages, the first one is to detect if there’s any intrusion in a flow, the second stage is to classify the intrusion into the corresponding attack, the attacks are: Dos attack, DDos attack, Botnet and brute force.

We tried a variety of learning algorithms for the the first stage, the table below summarize the algorithms we chose and its score on the validation set after hyperparameters tuning

We chose the decision tree model for having the highest score while maintaining a small model complexity thus lowest inference time.

The second table is about model scores in the second stage, it summarize the algorithms we chose and its score on the validation set after hyperparameters tuning We also chose the decision tree model.

bibs2091 / Anomaly-detection-system