INK-USC / RE-Net

Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs (EMNLP 2020)
http://inklab.usc.edu/renet/
435 stars 95 forks source link

Out of memory when pretraining #43

Closed xxxiaol closed 3 years ago

xxxiaol commented 3 years ago

Hello Woojeong,

When I run pretrain.py as described in instruction:

python pretrain.py -d ICEWS18 --gpu 0 --dropout 0.5 --n-hidden 200 --lr 1e-3 --max-epochs 20 --batch-size 1024

It gets OOM error in the first batch:

Traceback (most recent call last): File "pretrain.py", line 139, in train(args) File "pretrain.py", line 83, in train loss = model(batch_data, true_s, true_o, graph_dict) File "/home/liux/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/data1/liux/RE-Net/global_model.py", line 47, in forward packed_input = self.aggregator(sorted_t, self.ent_embeds, graph_dict, reverse=reverse) File "/home/liux/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/data1/liux/RE-Net/Aggregator.py", line 57, in forward self.rgcn1(batched_graph, reverse) File "/home/liux/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/data1/liux/RE-Net/RGCN.py", line 39, in forward self.propagate(g, reverse) File "/data1/liux/RE-Net/RGCN.py", line 91, in propagate g.update_all(lambda x: self.msg_func(x, reverse), fn.sum(msg='msg', out='h'), self.apply_func) File "/home/liux/.local/lib/python3.6/site-packages/dgl/heterograph.py", line 4501, in update_all ndata = core.message_passing(g, message_func, reduce_func, apply_node_func) File "/home/liux/.local/lib/python3.6/site-packages/dgl/core.py", line 291, in message_passing msgdata = invoke_edge_udf(g, ALL, g.canonical_etypes[0], mfunc, orig_eid=orig_eid) File "/home/liux/.local/lib/python3.6/site-packages/dgl/core.py", line 82, in invoke_edge_udf return func(ebatch) File "/data1/liux/RE-Net/RGCN.py", line 91, in g.update_all(lambda x: self.msg_func(x, reverse), fn.sum(msg='msg', out='h'), self.apply_func) File "/data1/liux/RE-Net/RGCN.py", line 84, in msg_func weight = self.weight.index_select(0, edges.data['type_s']).view( RuntimeError: CUDA out of memory. Tried to allocate 11.06 GiB (GPU 1; 10.92 GiB total capacity; 578.54 MiB already allocated; 8.13 GiB free; 756.00 MiB reserved in total by PyTorch)

I wonder why the update step needs so much memory. Could you please help me? Thanks a lot! By the way, my DGL version is dgl-cu102 (don't know whether this difference causes the error).

woojeongjin commented 3 years ago

Could you try using a smaller batch size? and what size is the gpu memory in your system?

xxxiaol commented 3 years ago

Thanks for your reply! But small batch sizes like 16 don't work either. This may be due to different versions of dgl and cuda, as the code runs smoothly in another environment with cuda 11.2.

hhdo commented 2 years ago

Thanks for your reply! But small batch sizes like 16 don't work either. This may be due to different versions of dgl and cuda, as the code runs smoothly in another environment with cuda 11.2.

Hello! Are you remember which dgl and torch version you used with cuda11.2. When I ran with cuda11.1 and torch1.6-1.8 on Tesla A100, there were some code going error( in pretrain and train). Have you ever fixed some code when you ran with cuda11.* Thank you a lot~

woojeongjin commented 2 years ago

Hi! No I haven't tested with Cuda11. Please run the code with Cuda10.1! This one was tested.