FDUDSDE / MAGIC

Codes and data for USENIX Security 24 paper "MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning"
MIT License
59 stars 10 forks source link

Question about wget dataset results #13

Open jiangdie666 opened 2 months ago

jiangdie666 commented 2 months ago

I have evaluated the original dataset using your project's code to train the generated pkl, and also the trained pkl that comes with your project, respectively. But the results are not satisfactory, is it because I didn't set other parameter details. Raw data evaluation results from my own training

python3 eval.py --dataset wget --device 0
Loading processed wget dataset...
[n_graph, n_node_feat, n_edge_feat]: [150, 8, 4]
Loading processed wget dataset...
[n_graph, n_node_feat, n_edge_feat]: [150, 8, 4]
AUC: 0.41680000000000006
F1: 0.6666666662222221
PRECISION: 0.5
RECALL: 1.0
TN: 0
FN: 0
TP: 25
FP: 25
#Test_AUC: 0.4168±0.0000

This is the pkl that comes with your project.

python3 eval.py --dataset wget --device 0
Loading processed wget dataset...
[n_graph, n_node_feat, n_edge_feat]: [150, 8, 4]
Loading processed wget dataset...
[n_graph, n_node_feat, n_edge_feat]: [150, 8, 4]
AUC: 0.47440000000000004
F1: 0.6666666662222221
PRECISION: 0.5
RECALL: 1.0
TN: 0
FN: 0
TP: 25
FP: 25
#Test_AUC: 0.4744±0.0000
jiangdie666 commented 2 months ago

Sorry for also reporting the zero-dimensional tensor error while training the darpar dataset, have you had this problem while training the data? image

Jimmyokok commented 2 months ago
jiangdie666 commented 2 months ago

I wonder if it's possible that my environment is not the same as yours, both of which I'm executing in the following environment Python 3.10.13 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 conda install -c dglteam/label/th21_cu121 dgl Because I didn't find a good Deep Learning version to install when I installed your environment requirements dgl=1.0.0. Or can you provide how you installed the 1.0.0 version of DGL?

Jimmyokok commented 2 months ago

I have tried to evaluate wget under your environment setting (pytorch==2.1.0 and dgl==2.0.0). I'm getting the same results as using dgl==1.0.0 both with and without the pre-trained pkls.

Jimmyokok commented 2 months ago

Did you obtain the graphs.pkl from parsing the raw logs or from the pkl provided by MAGIC?

jiangdie666 commented 2 months ago

image I trained directly from your trained pkl this time and still had some problems with the results

Jimmyokok commented 2 months ago

What is your k (i.e. num_neighbors)? Using k == 1 on wget dataset could be the cause.

Jimmyokok commented 2 months ago

The "zero-dimension" error is simply a bug. Modifying loss, _ = model(g) to loss = model(g) fixes the bug.

jiangdie666 commented 2 months ago

Sorry for such a simple code question, I can't believe I forgot about it. I'm thankful that the training on the darpar data is now running successfully! image I didn't move k, but I think the code means that the default parameter is 2 if you don't change it. image

Jimmyokok commented 2 months ago

Yes. I'm getting normal evaluation results when k == 2 but results like yours when k == 1.

jiangdie666 commented 2 months ago

image I am very sorry, I tried to change the value of k, but the result is still strange, the score is a bit too weird.

Jimmyokok commented 2 months ago

If your graphs.pkl is not the provided one, make sure that node type in index 2 is 'task'.

jiangdie666 commented 2 months ago

I found the problem, before I said I used your original data, but I only used your own checkpoint.pkl, I forgot that graphs.pkl also comes with a graphs.zip compressed package, the following results are generated by the project's own checkpoint.pkl and my step-by-step data processed in accordance with the project from scratch own graphs.pkl. image Then I unzipped the graphs.zip and used the project's own pkl and its own checkpoint to achieve the expected results, I think it may be that there was a problem with the data processing in the beginning using the wget_parser.py script, or there was a problem with the call to load_rawdata function to generate the graphs.pkl which resulted in a result of There is a problem. image I'll look at it again myself, thanks for the reply

Jimmyokok commented 2 months ago

Does your version of graphs.pkl matches the size of the provided one? If not, what is your data source? And most importantly, make sure that node type in index 2 is 'task', which is very important to the detection performance. If not, find the index for 'task' and modify line 28 of ./model/eval.py to out = pooler(g, out, [index_for_task]).cpu().numpy()

jiangdie666 commented 2 months ago

I retrained the dataset and added the category display code to the wget_parese code kind and found that the task is indeed at index 3. So I changed the index of the eval code as you said, but the result is still incorrect and my graphs.pkl is exactly the same size as the graphs.pkl in your zip. Very strange. image image

Jimmyokok commented 2 months ago

Is it possible that the order of the raw logs is different, which results in incorrect labeling during loaddata and triggers the shift in node type indices as a byproduct?

jiangdie666 commented 2 months ago

I just tried the indexes 0-7, and it still didn't work well. I'll start with downloading the data in the afternoon and try building it again. It's a really strange problem.

Jimmyokok commented 2 months ago

I forgot if attack logs should be the first 25 or the last 25 logs to be parsed, but this absolutely matters.

jiangdie666 commented 2 months ago

Your comment woke me up to the fact that I've been so obsessed with the fact that it wasn't my environment or code manipulation that was at fault that I forgot if there was a problem in processing the dataset in the first place. I found that the original code used the ls function directly when processing the 150 graph data, which may have resulted in the first 25 logs that were not processed corresponding to the ATTACK data. So I modified the code to prove that this was the problem. image This was modified before running 3e04530785192599491089126c7cd32 Here's the modified run d27cba43b5c527cfb23c87eb24e187b So the task index changed also because the data wasn't processed properly, and ultimately the eval code doesn't need to change, which is 2. image Thank you so much for answering my questions over and over again! 谢谢

SaraDadjouy commented 1 month ago

@jiangdie666 @Jimmyokok Hello. Thank you for sharing. I had the same problem.

I have another question. If I'm not wrong, in the original paper the results for wget were reported as follows: Screenshot 2024-06-10 184907

I have done the Quick Evaluation and got the following results:

[n_graph, n_node_feat, n_edge_feat]: [150, 8, 4] Loading processed wget dataset... [n_graph, n_node_feat, n_edge_feat]: [150, 8, 4] AUC: 0.9359999999999999 F1: 0.9056603768600924 PRECISION: 0.8571428571428571 RECALL: 0.96 TN: 21 FN: 1 TP: 24 FP: 4

Test_AUC: 0.9360±0.0000

I also saw that the last results @jiangdie666 shared were close to mine. What might be the reason for the different results for Precision, F1, and AUC?

Jimmyokok commented 1 month ago

@jiangdie666 @Jimmyokok Hello. Thank you for sharing. I had the same problem.

I have another question. If I'm not wrong, in the original paper the results for wget were reported as follows: Screenshot 2024-06-10 184907

I have done the Quick Evaluation and got the following results:

[n_graph, n_node_feat, n_edge_feat]: [150, 8, 4] Loading processed wget dataset... [n_graph, n_node_feat, n_edge_feat]: [150, 8, 4] AUC: 0.9359999999999999 F1: 0.9056603768600924 PRECISION: 0.8571428571428571 RECALL: 0.96 TN: 21 FN: 1 TP: 24 FP: 4 #Test_AUC: 0.9360±0.0000

I also saw that the last results @jiangdie666 shared were close to mine. What might be the reason for the different results for Precision, F1, and AUC?

I have rerun the Quick Evaluation with exactly the same data, checkpoints and code as in this repository, which gives me this: AUC: 0.96 F1: 0.9599999994999999 PRECISION: 0.96 RECALL: 0.96 TN: 24 FN: 1 TP: 24 FP: 1

Test_AUC: 0.9600±0.0000

Then, I modified the code to repeat the evaluation with random seed 0 to 49 and report the average, which gives me this: AUC: 0.952864+0.013846093456278552 F1: 0.9595209114984354+0.016390904351784117 PRECISION: 0.9663880341880342+0.031628609309857315 RECALL: 0.9536+0.018521339044464343 TN: 24.14+0.8248636250920511 FN: 1.16+0.463033476111609 TP: 23.84+0.463033476111609 FP: 0.86+0.8248636250920512

Test_AUC: 0.9529±0.0138

This is extremely strange, since I have never seen as many as 4 FPs. Meanwhile, I'm sure I have n_neighbor == 2 which is standard setting, and I have tried these evaluations with PyTorch 1.x and 2.x respectively, which yield the same result.

Jimmyokok commented 1 month ago

With seed 2022, which aligns with the repository code, I'm getting this: AUC: 0.9616 F1: 0.9795918362349021 PRECISION: 1.0 RECALL: 0.96 TN: 25 FN: 1 TP: 24 FP: 0

Test_AUC: 0.9616±0.0000

jiangdie666 commented 2 weeks ago

其实我用原始 您xiang项目自带的那个checkponti-wget.pt FP的结果也依旧为4,远不及您上面的结果 image

Jimmyokok commented 2 weeks ago

尝试一下多个种子平均?说不定2022在其他设备上正好表现非常差?

jiangdie666 commented 1 week ago

用项目自带的 graghs.zip解压后的图处理文件 graphs.pkl结果是满足的,那就说明依旧是预处理wget的数据代码的部分有小bug。