When I tried to use multiple GPUs for training model, the program reported an error, but I didn't know how to solve it
Traceback (most recent call last): File "train2.py", line 265, in <module> train(args) File "train2.py", line 157, in train loss_s = model(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), graph_dict, subject=True) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/proj_ws/RE-Net/model.py", line 84, in forward reverse=reverse) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/proj_ws/RE-Net/Aggregator.py", line 131, in forward s_len_non_zero, s_tem, r_tem, g, node_ids_graph, global_emb_list = get_sorted_s_r_embed_rgcn(s_hist, s, r, ent_embeds, graph_dict, global_emb) File "/home/ws/proj_ws/RE-Net/utils.py", line 222, in get_sorted_s_r_embed_rgcn s_hist_sorted.append(s_hist[idx]) RuntimeError: CUDA error: device-side assert triggered
When I tried to use multiple GPUs for training model, the program reported an error, but I didn't know how to solve it
Traceback (most recent call last): File "train2.py", line 265, in <module> train(args) File "train2.py", line 157, in train loss_s = model(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), graph_dict, subject=True) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/proj_ws/RE-Net/model.py", line 84, in forward reverse=reverse) File "/home/ws/anaconda3/envs/renet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/ws/proj_ws/RE-Net/Aggregator.py", line 131, in forward s_len_non_zero, s_tem, r_tem, g, node_ids_graph, global_emb_list = get_sorted_s_r_embed_rgcn(s_hist, s, r, ent_embeds, graph_dict, global_emb) File "/home/ws/proj_ws/RE-Net/utils.py", line 222, in get_sorted_s_r_embed_rgcn s_hist_sorted.append(s_hist[idx]) RuntimeError: CUDA error: device-side assert triggered