INK-USC / RE-Net

Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs (EMNLP 2020)
http://inklab.usc.edu/renet/
436 stars 95 forks source link

A problem for Training with multiple GPUs-"CUDA error: device-side assert triggered" #42

Closed mumu0419 closed 2 years ago

mumu0419 commented 3 years ago

When I tried to use multiple GPUs for training model, the program reported an error, but I didn't know how to solve it

Traceback (most recent call last): File "train2.py", line 265, in <module> train(args) File "train2.py", line 157, in train loss_s = model(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), graph_dict, subject=True) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/proj_ws/RE-Net/model.py", line 84, in forward reverse=reverse) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/proj_ws/RE-Net/Aggregator.py", line 131, in forward s_len_non_zero, s_tem, r_tem, g, node_ids_graph, global_emb_list = get_sorted_s_r_embed_rgcn(s_hist, s, r, ent_embeds, graph_dict, global_emb) File "/home/ws/proj_ws/RE-Net/utils.py", line 222, in get_sorted_s_r_embed_rgcn s_hist_sorted.append(s_hist[idx]) RuntimeError: CUDA error: device-side assert triggered

woojeongjin commented 2 years ago

Hi! We haven't tested multi gpu. You could use 1 gpu for training!