NYU-HSRN-Network-Data-Science-Group / AutoZeekWatch

An online, deployable machine learning network intrusion detection system for Zeek.
MIT License
3 stars 0 forks source link

catch non-json conn & skip #15

Closed olive-jy-song closed 4 months ago

diego-lopez8 commented 4 months ago

The json_data_file returned by ungzip() is actually a list of jsons, not a single json. This causes the train function to skip every file.

Heres my logs

python train.py --log-dir /opt/homebrew/var/logs
2024-02-13 10:02:53,192 - root - INFO - Using logdir: /opt/homebrew/var/logs (train.py)
2024-02-13 10:02:53,192 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-07 (train.py)
2024-02-13 10:02:53,192 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-07/conn.23:00:00-00:13:21.log.gz (train.py)
2024-02-13 10:02:53,195 - root - ERROR - File /opt/homebrew/var/logs/2024-02-07/conn.23:00:00-00:13:21.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,195 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-07/conn.22:00:00-23:00:00.log.gz (train.py)
2024-02-13 10:02:53,198 - root - ERROR - File /opt/homebrew/var/logs/2024-02-07/conn.22:00:00-23:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,198 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-07/conn.21:51:47-22:00:00.log.gz (train.py)
2024-02-13 10:02:53,198 - root - ERROR - File /opt/homebrew/var/logs/2024-02-07/conn.21:51:47-22:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,199 - root - INFO - Checking /opt/homebrew/var/logs/2024-02-09 (train.py)
2024-02-13 10:02:53,200 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.06:06:34-07:00:00.log.gz (train.py)
2024-02-13 10:02:53,201 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.06:06:34-07:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,201 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.09:00:00-10:00:00.log.gz (train.py)
2024-02-13 10:02:53,202 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.09:00:00-10:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,202 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.23:06:59-00:07:56.log.gz (train.py)
2024-02-13 10:02:53,202 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.23:06:59-00:07:56.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,202 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.15:00:00-16:00:00.log.gz (train.py)
2024-02-13 10:02:53,204 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.15:00:00-16:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,204 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.07:00:00-08:00:00.log.gz (train.py)
2024-02-13 10:02:53,204 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.07:00:00-08:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,204 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.17:00:00-18:03:06.log.gz (train.py)
2024-02-13 10:02:53,205 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.17:00:00-18:03:06.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,205 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.19:10:10-20:05:05.log.gz (train.py)
2024-02-13 10:02:53,205 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.19:10:10-20:05:05.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,205 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.21:05:47-22:09:47.log.gz (train.py)
2024-02-13 10:02:53,206 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.21:05:47-22:09:47.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,206 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.14:00:00-15:00:00.log.gz (train.py)
2024-02-13 10:02:53,207 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.14:00:00-15:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,207 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.00:00:00-01:07:32.log.gz (train.py)
2024-02-13 10:02:53,207 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.00:00:00-01:07:32.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,207 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.13:05:43-14:00:00.log.gz (train.py)
2024-02-13 10:02:53,208 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.13:05:43-14:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,208 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.01:07:32-02:10:32.log.gz (train.py)
2024-02-13 10:02:53,208 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.01:07:32-02:10:32.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,209 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.18:03:06-19:10:10.log.gz (train.py)
2024-02-13 10:02:53,209 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.18:03:06-19:10:10.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,209 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.03:01:43-04:09:16.log.gz (train.py)
2024-02-13 10:02:53,209 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.03:01:43-04:09:16.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,209 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.12:04:12-13:05:43.log.gz (train.py)
2024-02-13 10:02:53,210 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.12:04:12-13:05:43.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,210 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.02:10:32-03:01:43.log.gz (train.py)
2024-02-13 10:02:53,210 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.02:10:32-03:01:43.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,210 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.16:00:00-17:00:00.log.gz (train.py)
2024-02-13 10:02:53,211 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.16:00:00-17:00:00.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,211 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.20:05:05-21:05:47.log.gz (train.py)
2024-02-13 10:02:53,211 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.20:05:05-21:05:47.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,211 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.11:12:49-12:04:12.log.gz (train.py)
2024-02-13 10:02:53,212 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.11:12:49-12:04:12.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,212 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.22:09:47-23:06:59.log.gz (train.py)
2024-02-13 10:02:53,212 - root - ERROR - File /opt/homebrew/var/logs/2024-02-09/conn.22:09:47-23:06:59.log.gz is not JSON. Skipping. (train.py)
2024-02-13 10:02:53,212 - root - INFO - Opening file /opt/homebrew/var/logs/2024-02-09/conn.04:09:16-05:13:17.log.gz (train.py)
Traceback (most recent call last):
  File "/Users/diego/Projects/NIDS/NIDS/train.py", line 105, in <module>
    main()
  File "/Users/diego/Projects/NIDS/NIDS/train.py", line 95, in main
    np_arr = preprocess_json(json_data_file)
  File "/Users/diego/Projects/NIDS/NIDS/utils.py", line 43, in preprocess_json
    for line in json_batch.splitlines():
AttributeError: 'dict' object has no attribute 'splitlines'

Every line is a json, which are then processed by the preprocess_json with

    for line in json_batch.splitlines():
        # log_entry is now a single json log from the file
        log_entry = json.loads(line.strip())
        data_list.append([log_entry[feature] for feature in features])
    np_arr = np.array(data_list)

Maybe its enough to just test the first line of the file to see if its a json, and if so load the whole file? If you have a better idea feel free as well

You can test by just adding a file with any random chars in it and gzipping it so it gets processed by the train script

diego-lopez8 commented 4 months ago

also rebase please so we can test the other features (like skipping top level files) :)