A possible solution to OOM in Metattack.

Leirunlin commented 1 year ago

Hi! As mentioned in issues #90 and #127, OOM occurs when running Metattack in a higher version of Pytorch. I check the source code in metattack.py and find that function ''get_adj_score()'' seems to be the reason. https://github.com/DSE-MSU/DeepRobust/blob/1c0ef07088d0b90dbaad6a8863c22258b487c5c8/deeprobust/graph/global_attack/mettack.py#L125-L139 I try substituting line 128 and 130 with explicit subtraction and it works fine to me to avoid OOM, that is using adj_meta_grad = adj_meta_grad - adj_meta_grad.min() and adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0)) In fact, I found it is enough if only line 128 is replaced.
I think something goes wrong when "-=" and ".min()" are used together. It would be really helpful for me if anyone could offer an explanation to it.

pqypq commented 1 year ago

Hi! I've encountered the same problem that CUDA out of memory when using the following environment on Ubuntu 20.04.5 LTS:

numpy==1.21.6
scipy==1.7.3
torch==1.13.1
torch_geometric==2.2.0
torch_scatter==2.1.0+pt113cu116
torch_sparse==0.6.16+pt113cu116

I have already made the changes that you suggested. Could you please help me solve this problem? Thanks

Leirunlin commented 1 year ago

Hi! I'm sorry that I don't know why the changes don't work in your environment. Here, I will provide my environment and detailed steps that work for me:

numpy==1.23.4
scipy==1.9.3
torch==1.13.0
torch_geometric==2.2.0
torch_scatter==2.1.0
torch_sparse==0.6.15

I'm running the code on a GPU with 98GB memory. If no changes are made, I encounter CUDA out of memory just like you after generating two or three graphs against Metattack. (I guess there are gradients or something not removed from GPU.) After doing the changes I mentioned, the memory cost on dataset Cora is about 3000-4000M, which is now acceptable to me. I wonder if it is possible for you to provide more information about your problem, like how did you make the changes and the memory cost in your cases. Thanks.

pqypq commented 1 year ago

Hi, thanks for your reply! I'm now working on the mettack on the graph data. I'm running the cora dataset on a GPU with 10.76GB memory. I tried to replace

adj_meta_grad -= adj_meta_grad.min()
adj_meta_grad -= torch.diag(torch.diag(adj_meta_grad, 0))

to

adj_meta_grad = adj_meta_grad - adj_meta_grad.min()
adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0))

When the process comes to around 45%, I encountered the OOM error, the error message shows like below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB 
(GPU 1; 10.76 GiB total capacity; 9.90 GiB already allocated; 11.56 MiB free; 9.91 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Do you have any suggestions for me?

Thanks!

Leirunlin commented 1 year ago

Hi! The phenomenon is quite similar to me when no change is made. The OOM occurs in the middle of training. For you to debug, I suggest that you first check which part of Metattack lead to the problem in your environment. In issue #90, one mentioned that the inner_train() function could also be problematic. While the function works fine for me, I suggest that you check for it. Anyway, I would try reproducing the bug and solution on other devices, but I'm afraid it may not be so fast. Maybe we should have someone else provide more samples of the problem.

pqypq commented 1 year ago

Hi, thanks for your suggestions!

I tried to create an environment with the details below:

numpy==1.18.1
scipy==1.4.1
pytorch==1.8.0
torch_scatter==2.0.8
torch_sparse==0.6.12

Under these settings, I can successfully run the dataset: 'cora', 'cora_ml', 'citeseer', 'polblogs'. But for the dataset 'pubmed', still has the OOM problem. Have you ever encountered this problem?

Thanks!

Leirunlin commented 1 year ago

Hi! You can try MetaApprox from mettack.py. It is an approximated version of Metattack. In ProGNN, PubMed is attacked using MetaApprox. If it still does not work for you, maybe you should try more scalable attacks.

nowyouseemejoe commented 1 year ago

I found an efficient implementation in GreatX, hope it helps you.

pqypq commented 1 year ago

Hi @Leirunlin @nowyouseemejoe , thanks for your advice, I will try that!

ChandlerBang commented 1 year ago

Thank you all for the great discussion and suggestions! We are a bit shorthanded right now and you may want to directly make pull request if you found any bugs.

For the OOM issue, mettack is very memory consuming and we need a ~30 GB GPU to run it on Pubmed with MetaApprox. I have just added a scalable global attack PRBCD.

pip install deeprobust==0.2.7

You may want to try python examples/graph/test_prbcd.py or take a look at test_prbcd.py.

EnyanDai commented 1 year ago

Hi, This problem can be solved by revising the Line 126 as: adj_meta_grad = adj_grad.detach() (-2 modified_adj.detach() + 1)

Hi! As mentioned in issues #90 and #127, OOM occurs when running Metattack in a higher version of Pytorch. I check the source code in metattack.py and find that function ''get_adj_score()'' seems to be the reason.

https://github.com/DSE-MSU/DeepRobust/blob/1c0ef07088d0b90dbaad6a8863c22258b487c5c8/deeprobust/graph/global_attack/mettack.py#L125-L139

I try substituting line 128 and 130 with explicit subtraction and it works fine to me to avoid OOM, that is using adj_meta_grad = adj_meta_grad - adj_meta_grad.min() and adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0)) In fact, I found it is enough if only line 128 is replaced. I think something goes wrong when "-=" and ".min()" are used together. It would be really helpful for me if anyone could offer an explanation to it.

pqypq commented 1 year ago

Hi, This problem can be solved by revising the Line 126 as: adj_meta_grad = adj_grad.detach() (-2 modified_adj.detach() + 1)

Hi! As mentioned in issues #90 and #127, OOM occurs when running Metattack in a higher version of Pytorch. I check the source code in metattack.py and find that function ''get_adj_score()'' seems to be the reason. https://github.com/DSE-MSU/DeepRobust/blob/1c0ef07088d0b90dbaad6a8863c22258b487c5c8/deeprobust/graph/global_attack/mettack.py#L125-L139

I try substituting line 128 and 130 with explicit subtraction and it works fine to me to avoid OOM, that is using adj_meta_grad = adj_meta_grad - adj_meta_grad.min() and adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0)) In fact, I found it is enough if only line 128 is replaced. I think something goes wrong when "-=" and ".min()" are used together. It would be really helpful for me if anyone could offer an explanation to it.

Hi Enyan,

Thank you for your suggestion! I tried to modify the code according to your way, but it still doesn't work on my device. May I ask how large is your GPU memory?

Thanks

DSE-MSU / DeepRobust

A possible solution to OOM in Metattack. #128