Make preprocessing functions for DNS.log, SSH.log, HTTP.log and SSL.log

diego-lopez8 commented 8 months ago

Make preprocessing functions and separate model training flow and inference script for the 4 new heterogeneous data sources

dns.log
ssl.log
http.log
ssh.log

zoe70416 commented 8 months ago

@diego-lopez8 @olive-jy-song @Zihang-Xia

Hi,

I've been considering the approach of building multiple models for each log type (dns.log, ssh.log, ssl.log, dhcp.log). After rereading the paper, I believe using KitNET could be a more efficient alternative for the following reasons:

KitENT operates as an ensemble of small (3-layer) autoencoders. It takes in multiple features, identifies correlated groups of features, and feeds them into individual autoencoders. These autoencoders generate anomaly scores, which are then combined by another autoencoder to produce the final RMSE score.
Our current proposed method aligns with KitNET's approach of building a model for each log and generating anomaly scores. However, KitNET offers built-in feature mapping functions and the ability to generate more small autoencoders, potentially enhancing our model.

My plan is to continue with the preprocessing part and merge all four data frames. I'd love to hear your thoughts on this approach!

diego-lopez8 commented 8 months ago

The features for various Zeek logs

https://corelight.com/hubfs/resources/zeek-cheatsheets/corelight-cheatsheet-poster.pdf

Planning to finish 2 of them this week

DNS and HTTP

diego-lopez8 commented 8 months ago

for domain request in dns.log, do a top-10 encode with a implicit "other" column, ie dont do it explicitly

diego-lopez8 commented 6 months ago

Please help in debugging SSH.log

(tf) diego@Troys-MacBook-Pro-2 NIDS % python3 train.py --log-dir  /opt/homebrew/var/logs --modules SSH
2024-04-25 15:43:18,975 - root - INFO - Using Modules ['SSH'] (train.py)
2024-04-25 15:43:18,975 - root - INFO - Using logdir: /opt/homebrew/var/logs (train.py)
2024-04-25 15:43:18,975 - root - INFO - Using Parameters - max_size_ae: 30, grace_feature_mapping: 5000, grace_anomaly_detector: 50000, learning_rate: 0.001, hidden_ratio: 0.5 (train.py)
2024-04-25 15:43:18,976 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-07 (train.py)
2024-04-25 15:43:18,976 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-09 (train.py)
2024-04-25 15:43:18,978 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-08 (train.py)
2024-04-25 15:43:18,980 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-08 (train.py)
2024-04-25 15:43:18,981 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-01 (train.py)
2024-04-25 15:43:18,982 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-06 (train.py)
2024-04-25 15:43:18,984 - root - INFO - Checking /opt/homebrew/var/logs/2022-09-24 (train.py)
2024-04-25 15:43:18,985 - root - INFO - Checking /opt/homebrew/var/logs/2024-03-07 (train.py)
2024-04-25 15:43:18,986 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-24 (train.py)
2024-04-25 15:43:18,987 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-23 (train.py)
2024-04-25 15:43:18,989 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-12 (train.py)
2024-04-25 15:43:18,990 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-12/ssh.00:05:28-01:11:26.log.gz (train.py)
Traceback (most recent call last):
  File "/Users/diego/miniforge3/envs/tf/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'auth_success'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/diego/Projects/NIDS/NIDS/train.py", line 238, in <module>
    main()
  File "/Users/diego/Projects/NIDS/NIDS/train.py", line 207, in main
    np_arr = preprocess_json_ssh(json_data_file)
  File "/Users/diego/Projects/NIDS/NIDS/utils.py", line 236, in preprocess_json_ssh
    df['auth_success'] = df['auth_success'].replace({False: 0, True: 1})
  File "/Users/diego/miniforge3/envs/tf/lib/python3.9/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/diego/miniforge3/envs/tf/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'auth_success'

zoe70416 commented 5 months ago

@diego-lopez8 Debugged.

(base) Zoe@Zoes-MBP NIDS % python train.py --log-dir /usr/local/logs --modules SSH 2024-05-22 09:13:32,968 - root - INFO - Using Modules ['SSH'] (train.py) 2024-05-22 09:13:32,968 - root - INFO - Using logdir: /usr/local/logs (train.py) 2024-05-22 09:13:32,968 - root - INFO - Using Parameters - max_size_ae: 30, grace_feature_mapping: 5000, grace_anomaly_detector: 50000, learning_rate: 0.001, hidden_ratio: 0.5 (train.py) 2024-05-22 09:13:32,968 - root - INFO - Checking /usr/local/logs/2024-02-09 (train.py) 2024-05-22 09:13:32,969 - root - INFO - Checking /usr/local/logs/2024-02-12 (train.py) 2024-05-22 09:13:32,972 - root - INFO - Checking /usr/local/logs/2024-02-11 (train.py) 2024-05-22 09:13:32,975 - root - INFO - Checking /usr/local/logs/2024-02-10 (train.py) 2024-05-22 09:13:32,978 - root - INFO - Model is saved successfully as ssh_kit.joblib. (train.py)

diego-lopez8 commented 5 months ago

Awesome! Can you make a PR to main for this fix and we can close the ticket?

zoe70416 commented 5 months ago

sure, will do it tomorrow!

diego-lopez8 commented 5 months ago

@zoe70416 was a PR made?

diego-lopez8 commented 4 months ago

merged

NYU-HSRN-Network-Data-Science-Group / AutoZeekWatch

Make preprocessing functions for DNS.log, SSH.log, HTTP.log and SSL.log #25