Johnsen92 / self-supervised-ids

Self-supervised machine learning network intrusion detection system
GNU General Public License v3.0
4 stars 0 forks source link

Learning #2

Open ThornsOoO opened 2 years ago

ThornsOoO commented 2 years ago

hi, Jonas I recently learned about intrusion detection found this nice project when looking for self-monitoring related work. but due to the lack of self-monitoring learning, I was confused in the running process. Do you have any relevant papers or materials? In addition, I notice that the results in the 'result' folder have reached an accuracy rate of more than 99% in the CIC-IDS-2017 data set. Is this code structure supervised pre-training before self-supervised training?

Best wishes chang

Johnsen92 commented 2 years ago

Hi Chang,

I am currently finishing my Master thesis on the project. You can find the current version at https://github.com/Johnsen92/self-supervised-ids-thesis/blob/main/thesis.pdf. For the CIC-IDS-2017 dataset we achieved 99.796% accuracy with a three layered LSTM model (hidden_size = 512) and 99.658% accuracy with a 10-layered transformer encoder model (3 attention heads and forward expansion = 2) with supervised training on 90% of the dataset (10% for validation) and without pre-training.

In our experiments we performed self-supervised pre-training (with 80% of the dataset) before doing supervised training with little data (10% - 0.1% of the dataset) resulting in minor improvements in accuracy in some instances when compared to a model which underwent superviesd training with the same amount of data but without pre-training.

Best, Jonas

ThornsOoO commented 2 years ago

Hi, Jonas:

Thank you for your generous reply.To be honest, I am also a master's student, but the gap with you is not a little. I will read your paper carefully in the next time and look forward to discussing it with you the next time.

Sincerely chang

Johnsen92 commented 2 years ago

You are welcome.

No worries, you will get there. A year ago when I started my thesis I knew very little about this topic and machine learning in general either. I look forward to your inputs. Keep in mind that the thesis is still a work in progress, so some things might still change over time.

Best, Jonas

ThornsOoO commented 2 years ago

Hi, jonas:

I sent you an email two weeks ago (e1226597@student.tuwien.ac.at) , but it didn't seem to arrive. So I'm very sorry to disturb you in GitHub.

Recently, I tried to use GRU instead of LSTM. The result of supervised training or self-supervision is close to that of LSTM(pre train 80%, train 1%, epoch = 10, proxy task = COMPOSITE).

The data sets I used before are CICIDS-2017.csv extracted by CIC-flow. In your thesis, I noticed that you use go-flow to extract features vectors, and then using adaptive-recurrent-IDs to obtain the .packle file. So is the feature extracted by go-flow better than CIC-flow?

In addition, you mentioned flows are represented as a single feature vector in the dataset, containing aggregated statistical data of the completed flow. The downside of this is that flows can be processed only once they are completed. In a real world scenario, this approach has major downsides since the IDS can only inspect communications in retrospect and never in real time. This was one of the reasons we decided to represent our flows as a sequence of packets instead of a single aggregated feature vector. in 4.2 data representation. In my understanding, because it is necessary to extract the feature vector of the data stream (such as transmission time) from the captured data packet through go-flow or CIC-flow, it can not meet the requirements of real-time inspection. So how should the sentence This was one of the reasons we decided to represent our flows as a sequence of packets instead of a single aggregated feature vector be understood? Doesn't flow here refer to some features after extraction? Why can we solve the problem of real-time inspection?

最后,祝愿新年快乐,身体健康

Sincerely Chang

Johnsen92 commented 2 years ago

Hey Chang!

Recently, I tried to use GRU instead of LSTM. The result of supervised training or self-supervision is close to that of LSTM(pre train 80%, train 1%, epoch = 10, proxy task = COMPOSITE).

Nice! Did you also try training it in a purely supervised fashion, i.e. doing supervised training with e.g. 90% of the dataset and validation with 10%. Would be interesting to see how good GRU is compared to LSTM without pre-training to have some baseline values.

I sent you an email two weeks ago (e1226597@student.tuwien.ac.at) , but it didn't seem to arrive. So I'm very sorry to disturb you in GitHub.

No worries, you are most likely to reach me on Github.

The data sets I used before are CICIDS-2017.csv extracted by CIC-flow. In your thesis, I noticed that you use go-flow to extract features vectors, and then using adaptive-recurrent-IDs to obtain the .packle file. So is the feature extracted by go-flow better than CIC-flow?

Hard to tell. It might be that for this specific case they are better suited.

Doesn't flow here refer to some features after extraction?

Flow refers to a grouping of packets which share certain characteristics. A common way of grouping flows is over the tuple <srcIP, dstIP, srcPort, dstPort, protocol>. After grouping the packets into flows, there are still multiple possible representations of said flows: One popular representation is a single feature vector of aggregated statistical data (like transfer time, # packets, avg. interarrival time, ....) per flow. But a flow might also be represented as sequence of packets where you have one feature vector per packet. Theoretically, the model can derive all the aggregated statistical information of the flow feature vector directly from the packet sequence. E.g. if it knows the interarrival times between each two packets in the sequence it can calculate the total transmission time of the flow, if it considers this information valuable. With the former prepresentation, it is harder to do real time inspection since you can only really extract a meaningfull feature vector after the flow has been fully transmitted. With the latter representation you can inspect the packets one after another in "real time". The first statements is not entirely correct, since theoretically you could calculate an aggregated feature vector for an incomplete flow and just re-calculate it every time a new packet arrives. Since usually a lot of calculations are involved in extracting all the relevant statistical data of the flow, I imagine that this is not realy feasible due to limited computational resources of the IDS hardware.

I hope I answered your questions.

Best, Jonas

ThornsOoO commented 2 years ago

Hi, Jonas:

Thanks for your reply.

Supervised training with 90% data set is a bit time-consuming. I'll try it tonight. The current supervised training parameters are set as follows: train 10% Val 10% epoch 50. Results: ( LSTM: ACC 99.653%, precision 99.630% loss 0.0252 Gru: ACC 99.57% pre 99.54 loss 0.0247)

For data processing, in my understanding, I first use software similar to Wireshark to obtain data Pcap file, through the The data packets in pcap are analyzed for traffic characteristics in a period of time to obtain the feature vector. The "real-time" detection in thesis is actually the continuous analysis of data packets, right? In this way, it is really difficult for IDS hardware devices.

In addition, there is a small problem. The dataset is .pickle format, which is not easy to view data. I try to convert it to CSV files, may be the code stored in the conversion is too directly, and cannot be opened. Did you encounter similar situations during the experiment? Thank you very much.

Sincerely Chang

Johnsen92 commented 2 years ago

Hi Chang,

Supervised training with 90% data set is a bit time-consuming. I'll try it tonight. The current supervised training parameters are set as follows: train 10% Val 10% epoch 50. Results: ( LSTM: ACC 99.653%, precision 99.630% loss 0.0252 Gru: ACC 99.57% pre 99.54 loss 0.0247)

Nice! With the CIC-IDS2017 dataset I assume? Thats already very high.

For data processing, in my understanding, I first use software similar to Wireshark to obtain data Pcap file, through the The data packets in pcap are analyzed for traffic characteristics in a period of time to obtain the feature vector. The "real-time" detection in thesis is actually the continuous analysis of data packets, right? In this way, it is really difficult for IDS hardware devices.

Correct, but you could of course also capture the traffic immediately after you capture it in "real time". Ultimately, an NIDS is of little use if it can only detect attacks in retrospect. Resources are always a constraint for an IDS, but it is not unthinkable to have a NN model running directly on some node of the network analyzing packets as they come in. There are already some working examples available e.g.: https://arxiv.org/abs/1802.09089

In addition, there is a small problem. The dataset is .pickle format, which is not easy to view data. I try to convert it to CSV files, may be the code stored in the conversion is too directly, and cannot be opened. Did you encounter similar situations during the experiment? Thank you very much.

Did you convert the whole dataset to CSV? Thats going to be a huge file. Maybe your editor has trouble opening it? You might want to maybe export only a fraction of the dataset so you can look at it in a table editor with reasonable response times. Also: Each flow in the dataset is a sequence of packets, so displaying them in a table is actually quite tricky. The dimensions for the input data are like this: N flows x M packets x L features

Best, Jonas

ThornsOoO commented 2 years ago

Hi, Jonas:

Thank you for your generous reply.

Yes, the last dataset used was cic-ids2017,and I wonder how to highlight the effect of self supervised learning.So in the last two days, I tried unsw15 dataset. Under the supervised training model, the precision and F1-score verified are only around 0.82. I tried to reduce the proportion of normal samples, but the results did not change well. So I'm going to change and add the features of the dataset, but it's a little difficult at first -. -

In the process of changing unsw15, I noticed Definition of ACC in statistics.py. During debugging, the shapes of both label and categories are [4096, 100, 1], input data [4096100,15], and output [4096100,1], but the shapes of predicted and targets are [4096] (I set val_batch_size = 4096), so does the model predict the attack type of a flow?

In addition, I noticed that the calculation of ACC is to add the correct prediction results of various categories together, so will the calculation result of ACC be lower than the actual ACC result of binary classification?

Sincerely Chang

Johnsen92 commented 2 years ago

Hi Chang,

Yes, the last dataset used was cic-ids2017,and I wonder how to highlight the effect of self supervised learning.

I am wondering that myself. I tried a lot of things but none seem to really work in this context. Maybe there is no real effect with this setting because the datasets are too easy to classify or because our approach doesn't work.

In the process of changing unsw15, I noticed Definition of ACC in statistics.py. During debugging, the shapes of both label and categories are [4096, 100, 1], input data [4096100,15], and output [4096100,1], but the shapes of predicted and targets are [4096] (I set val_batch_size = 4096), so does the model predict the attack type of a flow?

No, the model only predicts whether a flow is an attack or no-attack, but in the statistics (ClassStats) we still track which categories are classified with higher accuracy than others. The shape of the predicted data is [4096] because we only look at the last stage (the output of the LSTM after processing the last data token) because at this stage the model has the most information and is therefore likely the most accurate. It differs for the transformer model though as you don't have chronological processing of data tokens.

In addition, I noticed that the calculation of ACC is to add the correct prediction results of various categories together, so will the calculation result of ACC be lower than the actual ACC result of binary classification?

We only did binary classification, therefore we add all prediction results together for the final accuracy.

Best. Jonas

ThornsOoO commented 2 years ago

Hi, Jonas:

As for the effect of self-monitoring, I recently tried to use the pre-training model obtained from one dataset through proxy task to train on another dataset (0.1% ~ 1% train), which has an overall improvement of 3% ~ 10%. The result is good, but I don't know how to explain it. Is it because different data sets expand the learning breadth of pre-training? If so, would it be better to use more other datasets?

Sincerely Chang