Closed Thuy-g closed 2 years ago
Hi Thuy,
From your error message "Tried to allocate 1000.00 MiB (GPU 0; 10.92 GiB total capacity; 7.00 GiB already allocated; 22.62 MiB free; 7.02 GiB reserved in total by PyTorch)", it seems that there're other programs using your GPU memory. Because PyTorch failed to allocate 1GB more memory when it only allocated 7 GB and your GPU is of 11 GB total capacity.
Thank you very much for replying me. I changed to another GPU (changed the value of CUDA_VISIBLE_DEVICES) and now your program can run successfully on my multi-GPU server which has 3 GPUs GTX 1080 Ti. Which GPU did you use? It took about 9 hours to finish running your program with dataset FB15k on my GPU server. Is there any method to make your program run faster, e.g., let 3 GPUs run in parallel,...?
Thank you for developing great work, RotatE. I'm really interested in your research.
dl-box@DL-Box:~/Downloads/RotatE$ CUDA_VISIBLE_DEVICES=0 python -u codes/run.py --do_train \
dl-box@DL-Box:~/Downloads/RotatE$ bash run.sh train ComplEx FB15k 0 0 1024 256 1000 500.0 1.0 0.001 150000 16 -de -dr -r 0.000002 1.10.0+cu102 Start Training...... 2021-11-08 04:46:49,552 INFO Model: ComplEx 2021-11-08 04:46:49,552 INFO Data Path: data/FB15k 2021-11-08 04:46:49,552 INFO #entity: 14951 2021-11-08 04:46:49,552 INFO #relation: 1345 2021-11-08 04:46:50,009 INFO #train: 483142 2021-11-08 04:46:50,058 INFO #valid: 50000 2021-11-08 04:46:50,120 INFO #test: 59071 2021-11-08 04:46:50,336 INFO Model Parameter Configuration: 2021-11-08 04:46:50,336 INFO Parameter gamma: torch.Size([1]), require_grad = False 2021-11-08 04:46:50,336 INFO Parameter embedding_range: torch.Size([1]), require_grad = False 2021-11-08 04:46:50,336 INFO Parameter entity_embedding: torch.Size([14951, 2000]), require_grad = True 2021-11-08 04:46:50,336 INFO Parameter relation_embedding: torch.Size([1345, 2000]), require_grad = True 2021-11-08 04:46:56,318 INFO Ramdomly Initializing ComplEx Model... 2021-11-08 04:46:56,318 INFO Start Training... 2021-11-08 04:46:56,318 INFO init_step = 0 2021-11-08 04:46:56,318 INFO batch_size = 1024 2021-11-08 04:46:56,318 INFO negative_adversarial_sampling = 1 2021-11-08 04:46:56,318 INFO hidden_dim = 1000 2021-11-08 04:46:56,318 INFO gamma = 500.000000 2021-11-08 04:46:56,318 INFO negative_adversarial_sampling = True 2021-11-08 04:46:56,318 INFO adversarial_temperature = 1.000000 2021-11-08 04:46:56,318 INFO learning_rate = 0 2021-11-08 04:46:57,568 INFO Training average regularization at step 0: 2.061783 2021-11-08 04:46:57,568 INFO Training average positive_sample_loss at step 0: 0.959978 2021-11-08 04:46:57,568 INFO Training average negative_sample_loss at step 0: 2.498887 2021-11-08 04:46:57,569 INFO Training average loss at step 0: 3.791215 2021-11-08 04:46:57,569 INFO Evaluating on Valid Dataset... 2021-11-08 04:46:58,255 INFO Evaluating the model... (0/6250) 2021-11-08 04:47:40,912 INFO Evaluating the model... (1000/6250) 2021-11-08 04:48:24,477 INFO Evaluating the model... (2000/6250) 2021-11-08 04:49:08,110 INFO Evaluating the model... (3000/6250) 2021-11-08 04:49:51,826 INFO Evaluating the model... (4000/6250) 2021-11-08 04:50:35,372 INFO Evaluating the model... (5000/6250) 2021-11-08 04:51:18,479 INFO Evaluating the model... (6000/6250) 2021-11-08 04:51:29,527 INFO Valid MRR at step 0: 0.000718 2021-11-08 04:51:29,527 INFO Valid MR at step 0: 7412.979920 2021-11-08 04:51:29,527 INFO Valid HITS@1 at step 0: 0.000050 2021-11-08 04:51:29,527 INFO Valid HITS@3 at step 0: 0.000190 2021-11-08 04:51:29,527 INFO Valid HITS@10 at step 0: 0.000820 2021-11-08 04:51:44,653 INFO Training average regularization at step 100: 1.869630 2021-11-08 04:51:44,653 INFO Training average positive_sample_loss at step 100: 0.878554 2021-11-08 04:51:44,654 INFO Training average negative_sample_loss at step 100: 2.214018 2021-11-08 04:51:44,654 INFO Training average loss at step 100: 3.415917 2021-11-08 04:51:59,475 INFO Training average regularization at step 200: 1.649423 2021-11-08 04:51:59,475 INFO Training average positive_sample_loss at step 200: 0.795739 2021-11-08 04:51:59,475 INFO Training average negative_sample_loss at step 200: 1.878687 2021-11-08 04:51:59,475 INFO Training average loss at step 200: 2.986636 2021-11-08 04:52:14,330 INFO Training average regularization at step 300: 1.493370 2021-11-08 04:52:14,330 INFO Training average positive_sample_loss at step 300: 0.723991 2021-11-08 04:52:14,330 INFO Training average negative_sample_loss at step 300: 1.647611 2021-11-08 04:52:14,330 INFO Training average loss at step 300: 2.679172 2021-11-08 04:52:29,411 INFO Training average regularization at step 400: 1.364369 2021-11-08 04:52:29,411 INFO Training average positive_sample_loss at step 400: 0.668379 2021-11-08 04:52:29,411 INFO Training average negative_sample_loss at step 400: 1.480148 2021-11-08 04:52:29,411 INFO Training average loss at step 400: 2.438632 2021-11-08 04:52:44,290 INFO Training average regularization at step 500: 1.252640 2021-11-08 04:52:44,290 INFO Training average positive_sample_loss at step 500: 0.615634 2021-11-08 04:52:44,290 INFO Training average negative_sample_loss at step 500: 1.347466 2021-11-08 04:52:44,290 INFO Training average loss at step 500: 2.234190 2021-11-08 04:52:59,189 INFO Training average regularization at step 600: 1.153765 2021-11-08 04:52:59,189 INFO Training average positive_sample_loss at step 600: 0.570805 2021-11-08 04:52:59,189 INFO Training average negative_sample_loss at step 600: 1.245437 2021-11-08 04:52:59,189 INFO Training average loss at step 600: 2.061886 2021-11-08 04:53:14,166 INFO Training average regularization at step 700: 1.065076 2021-11-08 04:53:14,166 INFO Training average positive_sample_loss at step 700: 0.524925 2021-11-08 04:53:14,166 INFO Training average negative_sample_loss at step 700: 1.163066 2021-11-08 04:53:14,166 INFO Training average loss at step 700: 1.909072 2021-11-08 04:53:29,006 INFO Training average regularization at step 800: 0.984837 2021-11-08 04:53:29,006 INFO Training average positive_sample_loss at step 800: 0.489442 2021-11-08 04:53:29,006 INFO Training average negative_sample_loss at step 800: 1.097700 2021-11-08 04:53:29,006 INFO Training average loss at step 800: 1.778408 2021-11-08 04:53:43,852 INFO Training average regularization at step 900: 0.911781 2021-11-08 04:53:43,852 INFO Training average positive_sample_loss at step 900: 0.451165 2021-11-08 04:53:43,852 INFO Training average negative_sample_loss at step 900: 1.044625 2021-11-08 04:53:43,852 INFO Training average loss at step 900: 1.659676 2021-11-08 04:53:59,565 INFO Training average regularization at step 1000: 0.845027 2021-11-08 04:53:59,565 INFO Training average positive_sample_loss at step 1000: 0.363237 2021-11-08 04:53:59,565 INFO Training average negative_sample_loss at step 1000: 1.000880 2021-11-08 04:53:59,565 INFO Training average loss at step 1000: 1.527086 2021-11-08 04:54:14,571 INFO Training average regularization at step 1100: 0.783731 2021-11-08 04:54:14,571 INFO Training average positive_sample_loss at step 1100: 0.312674 2021-11-08 04:54:14,571 INFO Training average negative_sample_loss at step 1100: 0.966706 2021-11-08 04:54:14,571 INFO Training average loss at step 1100: 1.423422 2021-11-08 04:54:29,543 INFO Training average regularization at step 1200: 0.726847 2021-11-08 04:54:29,543 INFO Training average positive_sample_loss at step 1200: 0.310942 ...................................................................................................
dl-box@DL-Box:~/Downloads/RotatE$ bash run.sh train RotatE wn18 0 0 512 1024 500 12.0 0.5 0.0001 80000 8 -de 1.10.0+cu102 Start Training...... 2021-11-08 04:46:15,756 INFO Model: RotatE 2021-11-08 04:46:15,756 INFO Data Path: data/wn18 2021-11-08 04:46:15,757 INFO #entity: 40943 2021-11-08 04:46:15,757 INFO #relation: 18 2021-11-08 04:46:15,886 INFO #train: 141442 2021-11-08 04:46:15,890 INFO #valid: 5000 2021-11-08 04:46:15,894 INFO #test: 5000 2021-11-08 04:46:16,147 INFO Model Parameter Configuration: 2021-11-08 04:46:16,147 INFO Parameter gamma: torch.Size([1]), require_grad = False 2021-11-08 04:46:16,147 INFO Parameter embedding_range: torch.Size([1]), require_grad = False 2021-11-08 04:46:16,147 INFO Parameter entity_embedding: torch.Size([40943, 1000]), require_grad = True 2021-11-08 04:46:16,147 INFO Parameter relation_embedding: torch.Size([18, 500]), require_grad = True 2021-11-08 04:46:19,692 INFO Ramdomly Initializing RotatE Model... 2021-11-08 04:46:19,692 INFO Start Training... 2021-11-08 04:46:19,692 INFO init_step = 0 2021-11-08 04:46:19,692 INFO batch_size = 512 2021-11-08 04:46:19,692 INFO negative_adversarial_sampling = 1 2021-11-08 04:46:19,692 INFO hidden_dim = 500 2021-11-08 04:46:19,692 INFO gamma = 12.000000 2021-11-08 04:46:19,692 INFO negative_adversarial_sampling = True 2021-11-08 04:46:19,692 INFO adversarial_temperature = 0.500000 2021-11-08 04:46:19,692 INFO learning_rate = 0 Traceback (most recent call last): File "codes/run.py", line 361, in
main(parse_args())
File "codes/run.py", line 305, in main
log = kge_model.train_step(kge_model, optimizer, train_iterator, args)
File "/home/dl-box/Downloads/RotatE/codes/model.py", line 267, in train_step
negative_score = model((positive_sample, negative_sample), mode=mode)
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dl-box/Downloads/RotatE/codes/model.py", line 159, in forward
score = model_func[self.model_name](head, relation, tail, mode)
File "/home/dl-box/Downloads/RotatE/codes/model.py", line 225, in RotatE
score = score.norm(dim = 0)
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/_tensor.py", line 442, in norm
return torch.norm(self, p, dim, keepdim, dtype=dtype)
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/functional.py", line 1442, in norm
return _VF.frobenius_norm(input, _dim, keepdim=keepdim)
RuntimeError: CUDA out of memory. Tried to allocate 1000.00 MiB (GPU 0; 10.92 GiB total capacity; 7.00 GiB already allocated; 22.62 MiB free; 7.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF