Closed diego-lopez8 closed 4 months ago
@diego-lopez8 @olive-jy-song @Zihang-Xia
Hi,
I've been considering the approach of building multiple models for each log type (dns.log, ssh.log, ssl.log, dhcp.log). After rereading the paper, I believe using KitNET could be a more efficient alternative for the following reasons:
KitENT operates as an ensemble of small (3-layer) autoencoders. It takes in multiple features, identifies correlated groups of features, and feeds them into individual autoencoders. These autoencoders generate anomaly scores, which are then combined by another autoencoder to produce the final RMSE score.
Our current proposed method aligns with KitNET's approach of building a model for each log and generating anomaly scores. However, KitNET offers built-in feature mapping functions and the ability to generate more small autoencoders, potentially enhancing our model.
My plan is to continue with the preprocessing part and merge all four data frames. I'd love to hear your thoughts on this approach!
The features for various Zeek logs
https://corelight.com/hubfs/resources/zeek-cheatsheets/corelight-cheatsheet-poster.pdf
Planning to finish 2 of them this week
DNS and HTTP
for domain request in dns.log, do a top-10 encode with a implicit "other" column, ie dont do it explicitly
Please help in debugging SSH.log
(tf) diego@Troys-MacBook-Pro-2 NIDS % python3 train.py --log-dir /opt/homebrew/var/logs --modules SSH
2024-04-25 15:43:18,975 - root - INFO - Using Modules ['SSH'] (train.py)
2024-04-25 15:43:18,975 - root - INFO - Using logdir: /opt/homebrew/var/logs (train.py)
2024-04-25 15:43:18,975 - root - INFO - Using Parameters - max_size_ae: 30, grace_feature_mapping: 5000, grace_anomaly_detector: 50000, learning_rate: 0.001, hidden_ratio: 0.5 (train.py)
2024-04-25 15:43:18,976 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-07 (train.py)
2024-04-25 15:43:18,976 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-09 (train.py)
2024-04-25 15:43:18,978 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-08 (train.py)
2024-04-25 15:43:18,980 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-08 (train.py)
2024-04-25 15:43:18,981 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-01 (train.py)
2024-04-25 15:43:18,982 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-06 (train.py)
2024-04-25 15:43:18,984 - root - INFO - Checking /opt/homebrew/var/logs/2022-09-24 (train.py)
2024-04-25 15:43:18,985 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-07 (train.py)
2024-04-25 15:43:18,986 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-24 (train.py)
2024-04-25 15:43:18,987 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-23 (train.py)
2024-04-25 15:43:18,989 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-12 (train.py)
2024-04-25 15:43:18,990 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-12/ssh.00:05:28-01:11:26.log.gz (train.py)
Traceback (most recent call last):
File "/Users/diego/miniforge3/envs/tf/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'auth_success'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/diego/Projects/NIDS/NIDS/train.py", line 238, in <module>
main()
File "/Users/diego/Projects/NIDS/NIDS/train.py", line 207, in main
np_arr = preprocess_json_ssh(json_data_file)
File "/Users/diego/Projects/NIDS/NIDS/utils.py", line 236, in preprocess_json_ssh
df['auth_success'] = df['auth_success'].replace({False: 0, True: 1})
File "/Users/diego/miniforge3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py", line 3807, in __getitem__
indexer = self.columns.get_loc(key)
File "/Users/diego/miniforge3/envs/tf/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
raise KeyError(key) from err
KeyError: 'auth_success'
@diego-lopez8 Debugged.
(base) Zoe@Zoes-MBP NIDS % python train.py --log-dir /usr/local/logs --modules SSH 2024-05-22 09:13:32,968 - root - INFO - Using Modules ['SSH'] (train.py) 2024-05-22 09:13:32,968 - root - INFO - Using logdir: /usr/local/logs (train.py) 2024-05-22 09:13:32,968 - root - INFO - Using Parameters - max_size_ae: 30, grace_feature_mapping: 5000, grace_anomaly_detector: 50000, learning_rate: 0.001, hidden_ratio: 0.5 (train.py) 2024-05-22 09:13:32,968 - root - INFO - Checking /usr/local/logs/2024-02-09 (train.py) 2024-05-22 09:13:32,969 - root - INFO - Checking /usr/local/logs/2024-02-12 (train.py) 2024-05-22 09:13:32,972 - root - INFO - Checking /usr/local/logs/2024-02-11 (train.py) 2024-05-22 09:13:32,975 - root - INFO - Checking /usr/local/logs/2024-02-10 (train.py) 2024-05-22 09:13:32,978 - root - INFO - Model is saved successfully as ssh_kit.joblib. (train.py)
Awesome! Can you make a PR to main for this fix and we can close the ticket?
sure, will do it tomorrow!
@zoe70416 was a PR made?
merged
Make preprocessing functions and separate model training flow and inference script for the 4 new heterogeneous data sources