""" https://github.com/frenky-strasak/My_bachelor_thesis """
My bachelor thesis is about detecting HTTPS malware by machine learning. The main concentration is to ssl communication where is a lot of new challenges how to detect this behaviour. This thesis is related with Stratosphere IPS project and Sebastian Garcia, who is leader of this projects (https://stratosphereips.org/).
Here is all code regarding my thesis. At this moment there is first project 'Features_evaluating' which goes into data sets and creates features for machine learning. Second project will be some machine learning algorithm (neural network, SVM, regression, ...).
The result of this thesis should be find some new features and techniques which helping to detect malware.
This project is for evaluating features. It goes into data sets and computes some features.
What is dataset? I define two types of dat sets: 'single dataset' and 'multi dataset'.
It is folder which has to contain following attributes:
You can get it from Wireshark, which is able to store your internet traffic in packets. https://www.wireshark.org/
Argus is able to extract the pcap file to binetflow file. There are just flows but no payload data which are sensitive. http://qosient.com/argus/
Bro folder contains several logs file and each of them describes some level in internet traffic (conn.log, dns.log, ssl.log, ...). There is no sensitive data. This folder is created by Bro which also extract the pcap file and creates this bro folder. https://www.bro.org/
So each folder which contains these three items (pcap file, binetflow file, Bro folder) is 'single data set'.
It is folder containing at the least one 'single dataset'.
Usually when you create some data from internet traffic, you create just one type of connection (one ip connects somewhere) but it is not enough. So we need a lot of these single data sets as malware, normal and compute them together.
Invent some new features is the main target of this project. At the end we should use the best of them.
List of current modules for features:
states: S0, S1, SF, REJ, S2, S3, RSTO, RSTR, RSTOS0, RSTRH, SH, SHR, OTH, module for creating plot data: 'create_plot_data_file_2()' in 'EvaluateData.py' script for plotting: 'ShowFigureBar.py'
First of fall you should set configure file. There is two values:
There are 2 options: 'Main_single.py' and 'Main_multi.py'.
First it takes your argument, which is name of the result. If there is no argument, the name of the result data file will be default 'new_plot_data.txt'. Next it looks into config file for 'path_to_single_dataset', which is path to 'single data file'.
Now it starts the evaluating. First it goes to binetflow.file where I take all flows, which has malware label ('Botnet'). Next it goes into bro folder for logs such as conn.log, ssl.log, where are all flows for our usage. It evaluates and computes current features.
Last step is creating plot data file. This file is located in 'PlotData' directory.
It is same like above, but evaluating is done for each 'single data sets'. So the resulting plot data file contains data from all 'single data sets'.
Once you choose one of these Main files and it creates the resulting 'plot data file' you can plot it by scripts in 'Plotdata' directory.
There are several scripts for plotting and each of them plots something different. It depends which feature you computes and which resulting 'plot data file' you created.
Example: The first evaluate feature is 'State of connection'. So for plotting this feature you call: 'python ShowFigureBar.py' name_of_resulting_plot_data.txt This command should show you chart contains data from dataset(s) relating to 'State of connection'.
For this viewing charts is used matplotlib library : http://matplotlib.org/