bibs2091 / Anomaly-detection-system

Machine learning based Intrusion detection system (IDS)
47 stars 16 forks source link

Can you talk about how you train, the steps of training (model source), thank you #5

Closed fangyongcheng closed 3 years ago

fangyongcheng commented 3 years ago

hello,author。 Can you talk about how you train, the steps of training (model source), thank you

bibs2091 commented 3 years ago

@fangyongcheng the training notebooks aren't well organized to publish yet, but this is the workflow we did :

Data cleaning

We noticed that there’s lot of errors in our data: starting from the fact the each column have the value “Label” like the the figure3 shows, we deleted the rows with this value, some columns of the data had static values for all rows (variance = 0) we deleted all of these columns, the dataset also had some nan (not a number) values in some columns, we deleted the rows with this value.

Data processing

We undersampled the dataset we used to make it smaller and more balanced, the original data is highly imbalanced, which will heavily affect our training, and the size of the dataset is too big to fit in any of the ram available for training

Data Splitting

We split the dataset into three different sets: training, validation(development) and a test set.

Choosing a Model

Our approach is to divide the process into two steps or stages, the first one is to detect if there’s any intrusion in a flow, the second stage is to classify the intrusion into the corresponding attack, the attacks are: Dos attack, DDos attack, Botnet and brute force.

We tried a variety of learning algorithms for the the first stage, the table below summarize the algorithms we chose and its score on the validation set after hyperparameters tuning image

We chose the decision tree model for having the highest score while maintaining a small model complexity thus lowest inference time.

The second table is about model scores in the second stage, it summarize the algorithms we chose and its score on the validation set after hyperparameters tuning image We also chose the decision tree model.