The code includes the function is_include_key_word (see below, and link), which is used during the concatenation of individual Time windows. I did not find the description of the function in the paper. Have I overlooked them? I am a bit surprised about the function and would like to discuss its function/meaning.
According to the comment, it should filter out nodes that do not appear in the training/validation data (i.e. noise).
This does not match the description in the paper and it cannot be known in advance what these are.
In addition, nodes should be filtered that occur in the test data but do not contribute much to recognition. This can also not be known in advance.
Furthermore, keywords such as netflow, var, usr, cadet occur very frequently.
The function can also be found in the other runs for the other data sets, with adjusted keywords in each case.
Experiments have shown that without any detection (without considering anomalousness of nodes) and only considering rareness and the keyword filter (and an adjusted threshold in the evaluation to 20 instead of 100) a detection of tn: 169, fp: 6, fn:0, fp:4 is possible.
The recognition performance without the keyword filter does not recognize any TW correctly.
(Log outputs are attached)
Since experiments have shown that the function has a significant influence on recognition performance and a more detailed description is not available, I wanted to ask how it should be understood in the overall context.
Without Detection and Threshold 20 (instead of 100) -> (with 100 everything is negative nothing is detected)
2023-09-13 10:18:10 - INFO - Anomalous queue: ['2018-04-06_11:03:19.756210028~2018-04-06_11:18:26.126177915.txt', '2018-04-06_11:18:26.126177915~2018-04-06_11:33:35.116170745.txt', '2018-04-06_11:33:35.116170745~2018-04-06_11:48:42.606135188.txt', '2018-04-06_11:48:42.606135188~2018-04-06_12:03:50.186115455.txt', '2018-04-06_12:03:50.186115455~2018-04-06_14:01:32.489584227.txt']
2023-09-13 10:18:10 - INFO - Anomaly score: 25.806802004279533
2023-09-13 10:18:10 - INFO - Anomalous queue: ['2018-04-07_00:00:00.008778912~2018-04-07_00:15:00.638758012.txt', '2018-04-07_00:15:00.638758012~2018-04-07_00:30:00.678739107.txt', '2018-04-07_22:48:54.756943468~2018-04-07_23:03:54.806921896.txt', '2018-04-07_23:03:54.806921896~2018-04-07_23:20:00.056902847.txt', '2018-04-07_23:35:19.036879610~2018-04-07_23:50:19.096860042.txt']
2023-09-13 10:18:10 - INFO - Anomaly score: 28.338620976295744
2023-09-13 10:18:10 - INFO - tn: 169
2023-09-13 10:18:10 - INFO - fp: 6
2023-09-13 10:18:10 - INFO - fn: 0
2023-09-13 10:18:10 - INFO - tp: 4
2023-09-13 10:18:10 - INFO - precision: 0.4
2023-09-13 10:18:10 - INFO - recall: 1.0
2023-09-13 10:18:10 - INFO - fscore: 0.5714285714285715
2023-09-13 10:18:10 - INFO - accuracy: 0.9664804469273743
2023-09-13 10:18:10 - INFO - auc_val: 0.9828571428571429
def is_include_key_word(s):
# The following common nodes don't exist in the training/validation data, but
# will have the influences to the construction of anomalous queue (i.e. noise).
# These nodes frequently exist in the testing data but don't contribute much to
# the detection (including temporary files or files with random name).
# Assume the IDF can keep being updated with the new time windows, these
# common nodes can be filtered out.
keywords = [
'netflow',
'/home/george/Drafts',
'usr',
'proc',
'var',
'cadet',
'/var/log/debug.log',
'/var/log/cron',
'/home/charles/Drafts',
'/etc/ssl/cert.pem',
'/tmp/.31.3022e',
]
flag = False
for i in keywords:
if i in s:
flag = True
return flag
I have the same confusion, it seems that the graph learning module did not play a significant role, but rather the noise filter in the rareness section played a crucial role
The code includes the function is_include_key_word (see below, and link), which is used during the concatenation of individual Time windows. I did not find the description of the function in the paper. Have I overlooked them? I am a bit surprised about the function and would like to discuss its function/meaning.
According to the comment, it should filter out nodes that do not appear in the training/validation data (i.e. noise). This does not match the description in the paper and it cannot be known in advance what these are. In addition, nodes should be filtered that occur in the test data but do not contribute much to recognition. This can also not be known in advance. Furthermore, keywords such as netflow, var, usr, cadet occur very frequently.
The function can also be found in the other runs for the other data sets, with adjusted keywords in each case.
Experiments have shown that without any detection (without considering anomalousness of nodes) and only considering rareness and the keyword filter (and an adjusted threshold in the evaluation to 20 instead of 100) a detection of tn: 169, fp: 6, fn:0, fp:4 is possible.
The recognition performance without the keyword filter does not recognize any TW correctly. (Log outputs are attached)
Since experiments have shown that the function has a significant influence on recognition performance and a more detailed description is not available, I wanted to ask how it should be understood in the overall context.
Without Detection and Threshold 20 (instead of 100) -> (with 100 everything is negative nothing is detected)
Without Keyword-Filter:
Without complete Rareness Score (just Anomalousness) = same metrics (everything detected):
Function it concerns:
https://github.com/ProvenanceAnalytics/kairos/blame/0e0b633beb46a1117c0a6d63be5d2481b59ac0dc/DARPA/CADETS_E3/anomalous_queue_construction.py#L93