DSE-MSU / DeepRobust

A pytorch adversarial library for attack and defense methods on images and graphs
MIT License
987 stars 192 forks source link

why I get such a high accuracy on pertubed graph? #92

Closed Ma-Ruinan closed 2 years ago

Ma-Ruinan commented 2 years ago

I run test_pgd.py、test_random.py、test_DICE.py as well as test.nettack.py, I set ptb_rate=0.2, but when I test classification accuracy on pertubed graph(modified_adj), the accuracy for these .py are all about 80%. I don't know why, can you help me?

ChandlerBang commented 2 years ago

Hi,

  1. Can you provide more details when you run those files test_pgd.py、test_random.py、test_DICE.py(global attack)? They work fine from my side.

    > python test_pgd.py
    Loading citeseer dataset...
    Downloading from https://raw.githubusercontent.com/danielzuegner/gnn-meta-attack/master/data/citeseer.npz to /tmp/citeseer.npz
    Done!
    === testing GCN on clean graph ===
    Test set results: loss= 1.2319 accuracy= 0.5770
    === testing GCN on Evasion attack ===
    100%|█████████████████████████████████████████████████████████████████████████| 200/200 [00:31<00:00,  6.37it/s]
    Test set results: loss= 1.2319 accuracy= 0.5770
    === testing GCN on Poisoning attack ===
    Test set results: loss= 1.2830 accuracy= 0.5430
  2. For test_nettack.py, it is a targeted attack. The evaluation process is quite different different from the above ones. You should check the accuracy on the attacked nodes instead of all the test nodes.

Ma-Ruinan commented 2 years ago

Thanks for answering for me,I am learning global attacks methods,so we just talk about global attacks,such as,test_pgd.py、test_random.py、test_DICE.py. Now,I set test_pgd.py as an example: for test_pgd.py, the result I got is: 1d25f4724604763080238083fc23fc0 c97e7a35c567738902b2a8ee212a1ee

0b9e045217596b35344894d75d3e1e1 8fdaafaa23699f3204e90ba23483295

The experimental effect is not as good as in the paper:(Misclassification rates (%) under5%perturbed edges) c626c410932dbb9362e301424306608 As we can see in paper,under the same setting, the misclassification rates I got is about 3% lower than papers. Can you tell me the possible reasons?

Ma-Ruinan commented 2 years ago

If the results I got above is normal(maybe it is caused by different implement),I wonder know how do you think about the robustness of the GCN?There are 5278 edges、2708nodes(Maximum connectivity) in cora dataset, we perturbed 5% of the edges(about 264 edges) in global attack, but the accuracy of the node classification only decrease 7-10%, and even we perturbed 50% of the whole edges, we still can get 34.5% accuracy.(There are 7 classes in cora datasets, random choose a class for one node, the accuracy is about 14%)

34f1e7e2f6b7f36c758308369dad2e0 8fdaafaa23699f3204e90ba23483295

Is this the powerful ability of semi-supervised learning in node classification tasks?

ChandlerBang commented 2 years ago

Q1. Performance discrepancy between the performances.

A1. The data splits used in the original paer are different in . Now I have updated test_pgd.py to make them consistent.

$ python test_pgd_new.py --dataset cora --seed=0
=== testing GCN on clean graph ===
Test set results: loss= 0.7849 accuracy= 0.8130
=== setup attack model ===
100%|█████████████████████████████████████████████████████████████████████████| 100/100 [00:08<00:00, 11.28it/s]
=== testing GCN on Evasion attack ===
Test set results: loss= 1.0142 accuracy= 0.7340
=== testing GCN on Poisoning attack ===
Test set results: loss= 1.0695 accuracy= 0.7200

There is still some inconsistency between our updated performance and the reported performance in their paper. If you check the author's repo https://github.com/KaidiXu/GCN_ADV_Train, you will find that they use different early stopping strategy and patience, which can greatly impact the final performance. For other parts, we actually followed their code to implement the attack model, so it should be secure to use our PyTorch version.


Q2. There are 7 classes in cora datasets, random choose a class for one node, the accuracy is about 14%.

A2. While there are 7 classes in Cora, the number of samples in each class is different.

ipdb> from collections import Counter
ipdb> Counter(labels[data.idx_test])
Counter({3: 319, 4: 149, 2: 144, 0: 130, 5: 103, 1: 91, 6: 64})
ipdb> 319/len(data.idx_test)
0.319

As you can see, if we guess all the labels to be 3, we can still get a performance of 31.9%. And I just tested the new test_pgd.py and I also obtain 31.9% when using ptb_rate=0.5.

$ python test_pgd_new.py --dataset cora --ptb_rate=0.5
=== testing GCN on clean graph ===
Test set results: loss= 0.7677 accuracy= 0.8190
=== setup attack model ===
100%|█████████████████████████████████████████████████████████████████████████| 100/100 [00:08<00:00, 12.26it/s]
=== testing GCN on Evasion attack ===
Test set results: loss= 1.8434 accuracy= 0.4270
=== testing GCN on Poisoning attack ===
Test set results: loss= 1.8690 accuracy= 0.3190
Ma-Ruinan commented 2 years ago

Really thanks!There is indeed some problem in old test_pgd.py, now it is right! And I found different data partition caused by different random seed do have a great impact on the results!

Ma-Ruinan commented 2 years ago

I have another question for test_pgd.py after you updated it. previously: 487c32fc3758295b1937b537a8a1923 now: 63cdebda9b372b7c9d62ef08a5910f6 I know in the new version,fake_labels is predicted by target_model, so we still don't use test dataset‘s true label, but I wonder know when we set test-time attack, why we don't use true label for train dataset+test dataset‘s predicted labels? I wonder know whether test-time attack or train-time attack make modified_adj by a suggorate model. Sorry to disturb you again

ChandlerBang commented 2 years ago

Thank you for pointing that out. I want to note that the training accuracy is usually very high; so the resulted performance does not differ a lot.

ipdb> (labels[idx_train] == fake_labels[idx_train]).sum()
tensor(140)

But I agree that using "train dataset+test dataset‘s predicted labels" would make more sense. I just change the fake labels to the combination of real training labels and predicted test labels. See below. https://github.com/DSE-MSU/DeepRobust/blob/2bcde200a5969dae32cddece66206a52c87c43e8/examples/graph/test_pgd.py#L91-L99

Ma-Ruinan commented 2 years ago

Really thanks!

Ma-Ruinan commented 2 years ago

I found this might be a reason for the issue, in you test_min_max.py, when you accomplish min-max topology attack, in one epoch, you only do one step for inner maximization, then followed by outer minimization(PGD for test time attack), in paper, author said like this: 1a089a1e42c3c67db1384af2c5eb718 image

But there still have a question: author said PGD attack is easier than min-max, but in their result, min-max caused higher misclassification rate! This is contradictory to the result they showed image

ChandlerBang commented 2 years ago

Hi,

(1) Yeah, that may be the point. We are a bit short-handed on testing the results for tuning the hyper-parameter of inner loops. It would be great if you can help us test it.

(2) When they say "easier", I think they mean that the optimization problem is easier to solve (or easier to get the optimal solution) as it does not involve the complex inner optimization. As a result, compared to min-max attacks, CE-PGD and CW-PGD yield better attacking performance since it is easier to attack a pre-defined GCN. However, min-max attacks work better when we need to attack a model that will be retrained. This is because min-max attack considers the inner optimization of the GNN model trained on the attacked data.

Ma-Ruinan commented 2 years ago

I understand,really thanks!