EagleW / PaperRobot

Code for PaperRobot: Incremental Draft Generation of Scientific Ideas
https://aclanthology.org/P19-1191
MIT License
472 stars 134 forks source link

It said that I need to have more RAM?tks #13

Closed pccai closed 4 years ago

pccai commented 4 years ago

OS: Ubuntu16.04 _x64 , 8 core 16G RAM CMD: python train.py --gpu=0 ...... Finish loading valid Finish loading test Epoch 0 /usr/local/lib/python3.6/site-packages/torch/nn/functional.py:1386: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") Traceback (most recent call last): File "train.py", line 263, in train(start_epoch+epoch) File "train.py", line 184, in train ntt[0], ntt[1], ntt[2]) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/usr/PaperRobot/Existing paper reading/model/GATA.py", line 19, in forward graph = self.graph(node_features, adj) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/usr/PaperRobot/Existing paper reading/model/GAT.py", line 18, in forward x = torch.cat([att(x, adj) for att in self.attentions], dim=1) File "/usr/PaperRobot/Existing paper reading/model/GAT.py", line 18, in x = torch.cat([att(x, adj) for att in self.attentions], dim=1) File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/usr/PaperRobot/Existing paper reading/model/graph_attention.py", line 37, in forward attention = F.dropout(attention, self.dropout, training=self.training) File "/usr/local/lib/python3.6/site-packages/torch/nn/functional.py", line 830, in dropout else _VF.dropout(input, p, training)) RuntimeError: [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0

EagleW commented 4 years ago

Thank you very much for your interest in our research. Perhaps you can try a smaller batch size by typing python train.py --batch_size 50. If that doesn't work you can decrease it again.

GabrielLin commented 4 years ago

@EagleW Could you please tell us what GPU do you use? Thanks.

EagleW commented 4 years ago

I use P100

GabrielLin commented 4 years ago

I use P100

Thanks.

davidrpugh commented 4 years ago

I am trying to replicate your work and am getting OOM errors on a P100 GPU with 16 GB when using the default batch size of 200. How much memory did your P100 GPU have?

EagleW commented 4 years ago

@davidrpugh Which part of the code are you running?

EagleW commented 4 years ago

I think you can decrease the batch size to 50

davidrpugh commented 4 years ago

Specifically I ran the ./Existing paper reading/train.py script with default parameters using one 16 GB P100 GPU and got an OOM error from the GPU. I can decrease the batch size but I wanted to start with the default params used in your paper. Was 200 the batch size used in the paper? Or something smaller?

EagleW commented 4 years ago

I think I use a much smaller batch size for it. Actually, the batch size didn't influence the model performance much

davidrpugh commented 4 years ago

I have tried batch sizes of 128, 64, 32, and 16. With --batch_size=16 I was able to train for one full epoch before getting GPU OOM error. I am going to try again with 8 but I feel that there must be something else I am missing.

EagleW commented 4 years ago

@davidrpugh I think you can add torch.cuda.empty_cache() after the line 263 in main.py

EagleW commented 4 years ago

The problem is that the get_subgraph will return a very large adj_ingraph matrix. If the batch size is too large, the adjacent matrix will be too large to fit in the memory