Wget Dataset - Githubissues

FDUDSDE / MAGIC

Codes and data for USENIX Security 24 paper "MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning"

MIT License

64 stars 10 forks source link

Wget Dataset #8

Closed kamelferrahi closed 6 months ago

kamelferrahi commented 6 months ago

I've encountered an issue that I'd like to address. The dataset wget in its compressed form that you have provided appears to differ from the dataset processed from the raw logs of wget by the wget_parser module. Interestingly, when using the compressed dataset, I obtain satisfactory results, whereas utilizing the preprocessed dataset from the wget_parser module categorizes every test point as an attack. Additionally, I've noticed a variance in the node dimension between the two approach of having the datasets the compressed wget dataset has a node dimension of 8, while the wget preprocessed dataset for the raw logs has a dimension of 14.

Jimmyokok commented 6 months ago

Found another bug, line 719 in wget_parser.py looks like this: if True:# src_type in valid_node_type and dst_type in valid_node_type:

It should be: if src_type in valid_node_type and dst_type in valid_node_type:

Now the preprocessed dataset should have a dimension 8, and its size should match the compressed dataset. If not, consider retrain the model, as the fundamental distribution of data is changed.

kamelferrahi commented 6 months ago

Oh yes Thanks a lot ! Can I know why we should keep only those 8 type of nodes ?

Jimmyokok commented 6 months ago

We focus on processes(task), files(file/path/link), network connections(address/socket) and memory-related node types(process_memory/mmaped_file). Memory-related nodes potentially serve as intermediate entities between processes and other processes/files, and are in huge numbers, so we keep them.

The other 6 node types are argv, iattr, pipe, block, xattr and shm. Among them, argv, iattr, xattr and shm are not helpful in case we are doing unsupervised behavior-based detection. For pipe and block, we discard them due to their limited quantities. Frequencies of node types in the wget dataset are: {'file': 448800, 'process_memory': 1500303, 'argv': 52175, 'task': 1832578, 'mmaped_file': 655447, 'iattr': 33432, 'path': 298485, 'socket': 550286, 'address': 58475, 'pipe': 32802, 'link': 177507, 'xattr': 1981, 'block': 106, 'shm': 40}

Besides, we never state that MAGIC only works under this combination of node types.

kamelferrahi commented 6 months ago

OKay thanks. for your precious help
I have experimented with the 14 node types, but unfortunately, the result weren't satisfactory.

Jimmyokok commented 6 months ago

I have run it myself, it also shows unsatisfactory result, BUT: The key problem is if 14 types are used instead of 8, the index of type task is changed from 2 to 3, which corrupts the following code:

if dataset != 'wget':
    out = pooler(g, out).cpu().numpy()
else:
    out = pooler(g, out, [2]).cpu().numpy()

In the 8 feature case, type 2 represents task, which is what we want to focus on. In the 14 case, type 2 actually points to argv, and detecting the abnormality of argv is totally meaningless, resulting the unsatisfactory performance you and I have seen.

After changing the index 2 to 3, the results are as follows:

AUC: 0.9672000000000001+0.011631680875952538
F1: 0.9508781620860278+0.018887909507804636
PRECISION: 0.9426373626373626+0.03640730819941588
RECALL: 0.9600000000000002+2.220446049250313e-16
TN: 23.5+1.02469507659596
FN: 1.0+0.0
TP: 24.0+0.0
FP: 1.5+1.02469507659596
#Test_AUC: 0.9672±0.0116

which is similar to the 8 feature case.

kamelferrahi commented 6 months ago

Oh I get it so we only aggregate the "tasks" node embedding to get the final graph embedding. Thank you very much this solution really helped me a lot for my graduation project!