Closed pccai closed 4 years ago
Thank you very much for your interest in our research. Perhaps you can try a smaller batch size by typing python train.py --batch_size 50. If that doesn't work you can decrease it again.
@EagleW Could you please tell us what GPU do you use? Thanks.
I use P100
I use P100
Thanks.
I am trying to replicate your work and am getting OOM errors on a P100 GPU with 16 GB when using the default batch size of 200. How much memory did your P100 GPU have?
@davidrpugh Which part of the code are you running?
I think you can decrease the batch size to 50
Specifically I ran the ./Existing paper reading/train.py
script with default parameters using one 16 GB P100 GPU and got an OOM error from the GPU. I can decrease the batch size but I wanted to start with the default params used in your paper. Was 200 the batch size used in the paper? Or something smaller?
I think I use a much smaller batch size for it. Actually, the batch size didn't influence the model performance much
I have tried batch sizes of 128, 64, 32, and 16. With --batch_size=16
I was able to train for one full epoch before getting GPU OOM error. I am going to try again with 8 but I feel that there must be something else I am missing.
@davidrpugh I think you can add torch.cuda.empty_cache() after the line 263 in main.py
The problem is that the get_subgraph will return a very large adj_ingraph matrix. If the batch size is too large, the adjacent matrix will be too large to fit in the memory
OS: Ubuntu16.04 _x64 , 8 core 16G RAM CMD: python train.py --gpu=0 ...... Finish loading valid Finish loading test Epoch 0 /usr/local/lib/python3.6/site-packages/torch/nn/functional.py:1386: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") Traceback (most recent call last): File "train.py", line 263, in
train(start_epoch+epoch)
File "train.py", line 184, in train
ntt[0], ntt[1], ntt[2])
File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward( input, kwargs)
File "/usr/PaperRobot/Existing paper reading/model/GATA.py", line 19, in forward
graph = self.graph(node_features, adj)
File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, *kwargs)
File "/usr/PaperRobot/Existing paper reading/model/GAT.py", line 18, in forward
x = torch.cat([att(x, adj) for att in self.attentions], dim=1)
File "/usr/PaperRobot/Existing paper reading/model/GAT.py", line 18, in
x = torch.cat([att(x, adj) for att in self.attentions], dim=1)
File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward( input, kwargs)
File "/usr/PaperRobot/Existing paper reading/model/graph_attention.py", line 37, in forward
attention = F.dropout(attention, self.dropout, training=self.training)
File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 830, in dropout
else _VF.dropout(input, p, training))
RuntimeError: [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0