CUDA out of memory (resolved) and method to make RotatE run faster

Thuy-g commented 2 years ago

Thank you for developing great work, RotatE. I'm really interested in your research.

I ran your program as the following, but I found that there is a bug "RuntimeError: CUDA out of memory". How did you debug? I changed the batch size from 1024 to 256 and the program could run successfully. But, I don't really want to change the batch size.

dl-box@DL-Box:~/Downloads/RotatE$ CUDA_VISIBLE_DEVICES=0 python -u codes/run.py --do_train \

--cuda \ --do_valid \ --do_test \ --data_path data/FB15k \ --model RotatE \ -n 256 -b 1024 -d 1000 \ -g 24.0 -a 1.0 -adv \ -lr 0.0001 --max_steps 150000 \ -save models/RotatE_FB15k_0 --test_batch_size 16 -de 2021-11-07 17:21:05,436 INFO Model: RotatE 2021-11-07 17:21:05,437 INFO Data Path: data/FB15k 2021-11-07 17:21:05,437 INFO #entity: 14951 2021-11-07 17:21:05,437 INFO #relation: 1345 2021-11-07 17:21:05,892 INFO #train: 483142 2021-11-07 17:21:05,941 INFO #valid: 50000 2021-11-07 17:21:06,000 INFO #test: 59071 2021-11-07 17:21:06,202 INFO Model Parameter Configuration: 2021-11-07 17:21:06,202 INFO Parameter gamma: torch.Size([1]), require_grad = False 2021-11-07 17:21:06,202 INFO Parameter embedding_range: torch.Size([1]), require_grad = False 2021-11-07 17:21:06,202 INFO Parameter entity_embedding: torch.Size([14951, 2000]), require_grad = True 2021-11-07 17:21:06,202 INFO Parameter relation_embedding: torch.Size([1345, 1000]), require_grad = True 2021-11-07 17:21:12,102 INFO Ramdomly Initializing RotatE Model... 2021-11-07 17:21:12,102 INFO Start Training... 2021-11-07 17:21:12,102 INFO init_step = 0 2021-11-07 17:21:12,102 INFO batch_size = 1024 2021-11-07 17:21:12,102 INFO negative_adversarial_sampling = 1 2021-11-07 17:21:12,102 INFO hidden_dim = 1000 2021-11-07 17:21:12,102 INFO gamma = 24.000000 2021-11-07 17:21:12,102 INFO negative_adversarial_sampling = True 2021-11-07 17:21:12,102 INFO adversarial_temperature = 1.000000 2021-11-07 17:21:12,102 INFO learning_rate = 0 Traceback (most recent call last): File "codes/run.py", line 361, in main(parse_args()) File "codes/run.py", line 305, in main log = kge_model.train_step(kge_model, optimizer, train_iterator, args) File "/home/dl-box/Downloads/RotatE/codes/model.py", line 300, in train_step loss.backward() File "/home/dl-box/.local/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/dl-box/.local/lib/python3.6/site-packages/torch/autograd/init.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 1.95 GiB (GPU 0; 10.92 GiB total capacity; 6.11 GiB already allocated; 866.06 MiB free; 7.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I ran the command line "bash run.sh train ComplEx FB15k 0 0 1024 256 1000 500.0 1.0 0.001 150000 16 -de -dr -r 0.000002" as the following (with dataset FB15k), and your program could run successfully on my Ubuntu server.

dl-box@DL-Box:~/Downloads/RotatE$ 1.10.0+cu102 Start Training...... 2021-11-08 04:46:49,552 INFO 2021-11-08 04:46:49,552 INFO 2021-11-08 04:46:49,552 INFO 2021-11-08 04:46:49,552 INFO 2021-11-08 04:46:50,009 INFO 2021-11-08 04:46:50,058 INFO 2021-11-08 04:46:50,120 INFO 2021-11-08 04:46:50,336 INFO 2021-11-08 04:46:50,336 INFO 2021-11-08 04:46:50,336 INFO 2021-11-08 04:46:50,336 INFO 2021-11-08 04:46:50,336 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:56,318 INFO 2021-11-08 04:46:57,568 INFO 2021-11-08 04:46:57,568 INFO 2021-11-08 04:46:57,568 INFO 2021-11-08 04:46:57,569 INFO 2021-11-08 04:46:57,569 INFO 2021-11-08 04:46:58,255 INFO 2021-11-08 04:47:40,912 INFO 2021-11-08 04:48:24,477 INFO 2021-11-08 04:49:08,110 INFO 2021-11-08 04:49:51,826 INFO 2021-11-08 04:50:35,372 INFO 2021-11-08 04:51:18,479 INFO 2021-11-08 04:51:29,527 INFO 2021-11-08 04:51:29,527 INFO 2021-11-08 04:51:29,527 INFO 2021-11-08 04:51:29,527 INFO 2021-11-08 04:51:29,527 INFO 2021-11-08 04:51:44,653 INFO 2021-11-08 04:51:44,653 INFO 2021-11-08 04:51:44,654 INFO 2021-11-08 04:51:44,654 INFO 2021-11-08 04:51:59,475 INFO 2021-11-08 04:51:59,475 INFO 2021-11-08 04:51:59,475 INFO 2021-11-08 04:51:59,475 INFO 2021-11-08 04:52:14,330 INFO 2021-11-08 04:52:14,330 INFO 2021-11-08 04:52:14,330 INFO 2021-11-08 04:52:14,330 INFO 2021-11-08 04:52:29,411 INFO 2021-11-08 04:52:29,411 INFO 2021-11-08 04:52:29,411 INFO 2021-11-08 04:52:29,411 INFO 2021-11-08 04:52:44,290 INFO 2021-11-08 04:52:44,290 INFO 2021-11-08 04:52:44,290 INFO 2021-11-08 04:52:44,290 INFO 2021-11-08 04:52:59,189 INFO 2021-11-08 04:52:59,189 INFO 2021-11-08 04:52:59,189 INFO 2021-11-08 04:52:59,189 INFO 2021-11-08 04:53:14,166 INFO 2021-11-08 04:53:14,166 INFO 2021-11-08 04:53:14,166 INFO 2021-11-08 04:53:14,166 INFO 2021-11-08 04:53:29,006 INFO 2021-11-08 04:53:29,006 INFO 2021-11-08 04:53:29,006 INFO 2021-11-08 04:53:29,006 INFO 2021-11-08 04:53:43,852 INFO 2021-11-08 04:53:43,852 INFO 2021-11-08 04:53:43,852 INFO 2021-11-08 04:53:43,852 INFO 2021-11-08 04:53:59,565 INFO 2021-11-08 04:53:59,565 INFO 2021-11-08 04:53:59,565 INFO 2021-11-08 04:53:59,565 INFO 2021-11-08 04:54:14,571 INFO 2021-11-08 04:54:14,571 INFO 2021-11-08 04:54:14,571 INFO 2021-11-08 04:54:14,571 INFO 2021-11-08 04:54:29,543 INFO 2021-11-08 04:54:29,543 INFO ............................ bash run.sh train ComplEx FB15k 0 0 1024 256 1000 500.0 1.0 0.001 150000 16 -de -dr -r 0.000002 Model: ComplEx Data Path: data/FB15k #entity: 14951 #relation: 1345 #train: 483142 #valid: 50000 #test: 59071 Model Parameter Configuration: Parameter gamma: torch.Size([1]), require_grad = False Parameter embedding_range: torch.Size([1]), require_grad = False Parameter entity_embedding: torch.Size([14951, 2000]), require_grad = True Parameter relation_embedding: torch.Size([1345, 2000]), require_grad = True Ramdomly Initializing ComplEx Model... Start Training... init_step = 0 batch_size = 1024 negative_adversarial_sampling = 1 hidden_dim = 1000 gamma = 500.000000 negative_adversarial_sampling = True adversarial_temperature = 1.000000 learning_rate = 0 Training average regularization at step 0: 2.061783 Training average positive_sample_loss at step 0: 0.959978 Training average negative_sample_loss at step 0: 2.498887 Training average loss at step 0: 3.791215 Evaluating on Valid Dataset... Evaluating the model... (0/6250) Evaluating the model... (1000/6250) Evaluating the model... (2000/6250) Evaluating the model... (3000/6250) Evaluating the model... (4000/6250) Evaluating the model... (5000/6250) Evaluating the model... (6000/6250) Valid MRR at step 0: 0.000718 Valid MR at step 0: 7412.979920 Valid HITS@1 at step 0: 0.000050 Valid HITS@3 at step 0: 0.000190 Valid HITS@10 at step 0: 0.000820 Training average regularization at step 100: 1.869630 Training average positive_sample_loss at step 100: 0.878554 Training average negative_sample_loss at step 100: 2.214018 Training average loss at step 100: 3.415917 Training average regularization at step 200: 1.649423 Training average positive_sample_loss at step 200: 0.795739 Training average negative_sample_loss at step 200: 1.878687 Training average loss at step 200: 2.986636 Training average regularization at step 300: 1.493370 Training average positive_sample_loss at step 300: 0.723991 Training average negative_sample_loss at step 300: 1.647611 Training average loss at step 300: 2.679172 Training average regularization at step 400: 1.364369 Training average positive_sample_loss at step 400: 0.668379 Training average negative_sample_loss at step 400: 1.480148 Training average loss at step 400: 2.438632 Training average regularization at step 500: 1.252640 Training average positive_sample_loss at step 500: 0.615634 Training average negative_sample_loss at step 500: 1.347466 Training average loss at step 500: 2.234190 Training average regularization at step 600: 1.153765 Training average positive_sample_loss at step 600: 0.570805 Training average negative_sample_loss at step 600: 1.245437 Training average loss at step 600: 2.061886 Training average regularization at step 700: 1.065076 Training average positive_sample_loss at step 700: 0.524925 Training average negative_sample_loss at step 700: 1.163066 Training average loss at step 700: 1.909072 Training average regularization at step 800: 0.984837 Training average positive_sample_loss at step 800: 0.489442 Training average negative_sample_loss at step 800: 1.097700 Training average loss at step 800: 1.778408 Training average regularization at step 900: 0.911781 Training average positive_sample_loss at step 900: 0.451165 Training average negative_sample_loss at step 900: 1.044625 Training average loss at step 900: 1.659676 Training average regularization at step 1000: 0.845027 Training average positive_sample_loss at step 1000: 0.363237 Training average negative_sample_loss at step 1000: 1.000880 Training average loss at step 1000: 1.527086 Training average regularization at step 1100: 0.783731 Training average positive_sample_loss at step 1100: 0.312674 Training average negative_sample_loss at step 1100: 0.966706 Training average loss at step 1100: 1.423422 Training average regularization at step 1200: 0.726847 Training average positive_sample_loss at step 1200: 0.310942 .......................................................................

However, before that time, I ran the following command line "bash run.sh train RotatE wn18 0 0 512 1024 500 12.0 0.5 0.0001 80000 8 -de 1.10.0+cu102" (with dataset wn18), and I also found that your program still has a bug "RuntimeError: CUDA out of memory". Would you please explain to me why sometimes your program has a bug "RuntimeError: CUDA out of memory", but why sometimes your program could run successfully by changing the dataset? How did you debug with this problem?

dl-box@DL-Box:~/Downloads/RotatE$ bash run.sh train RotatE wn18 0 0 512 1024 500 12.0 0.5 0.0001 80000 8 -de 1.10.0+cu102 Start Training...... 2021-11-08 04:46:15,756 INFO Model: RotatE 2021-11-08 04:46:15,756 INFO Data Path: data/wn18 2021-11-08 04:46:15,757 INFO #entity: 40943 2021-11-08 04:46:15,757 INFO #relation: 18 2021-11-08 04:46:15,886 INFO #train: 141442 2021-11-08 04:46:15,890 INFO #valid: 5000 2021-11-08 04:46:15,894 INFO #test: 5000 2021-11-08 04:46:16,147 INFO Model Parameter Configuration: 2021-11-08 04:46:16,147 INFO Parameter gamma: torch.Size([1]), require_grad = False 2021-11-08 04:46:16,147 INFO Parameter embedding_range: torch.Size([1]), require_grad = False 2021-11-08 04:46:16,147 INFO Parameter entity_embedding: torch.Size([40943, 1000]), require_grad = True 2021-11-08 04:46:16,147 INFO Parameter relation_embedding: torch.Size([18, 500]), require_grad = True 2021-11-08 04:46:19,692 INFO Ramdomly Initializing RotatE Model... 2021-11-08 04:46:19,692 INFO Start Training... 2021-11-08 04:46:19,692 INFO init_step = 0 2021-11-08 04:46:19,692 INFO batch_size = 512 2021-11-08 04:46:19,692 INFO negative_adversarial_sampling = 1 2021-11-08 04:46:19,692 INFO hidden_dim = 500 2021-11-08 04:46:19,692 INFO gamma = 12.000000 2021-11-08 04:46:19,692 INFO negative_adversarial_sampling = True 2021-11-08 04:46:19,692 INFO adversarial_temperature = 0.500000 2021-11-08 04:46:19,692 INFO learning_rate = 0 Traceback (most recent call last): File "codes/run.py", line 361, in main(parse_args()) File "codes/run.py", line 305, in main log = kge_model.train_step(kge_model, optimizer, train_iterator, args) File "/home/dl-box/Downloads/RotatE/codes/model.py", line 267, in train_step negative_score = model((positive_sample, negative_sample), mode=mode) File "/home/dl-box/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/dl-box/Downloads/RotatE/codes/model.py", line 159, in forward score = model_func[self.model_name](head, relation, tail, mode) File "/home/dl-box/Downloads/RotatE/codes/model.py", line 225, in RotatE score = score.norm(dim = 0) File "/home/dl-box/.local/lib/python3.6/site-packages/torch/_tensor.py", line 442, in norm return torch.norm(self, p, dim, keepdim, dtype=dtype) File "/home/dl-box/.local/lib/python3.6/site-packages/torch/functional.py", line 1442, in norm return _VF.frobenius_norm(input, _dim, keepdim=keepdim) RuntimeError: CUDA out of memory. Tried to allocate 1000.00 MiB (GPU 0; 10.92 GiB total capacity; 7.00 GiB already allocated; 22.62 MiB free; 7.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Edward-Sun commented 2 years ago

Hi Thuy,

From your error message "Tried to allocate 1000.00 MiB (GPU 0; 10.92 GiB total capacity; 7.00 GiB already allocated; 22.62 MiB free; 7.02 GiB reserved in total by PyTorch)", it seems that there're other programs using your GPU memory. Because PyTorch failed to allocate 1GB more memory when it only allocated 7 GB and your GPU is of 11 GB total capacity.

Thuy-g commented 2 years ago

Thank you very much for replying me. I changed to another GPU (changed the value of CUDA_VISIBLE_DEVICES) and now your program can run successfully on my multi-GPU server which has 3 GPUs GTX 1080 Ti. Which GPU did you use? It took about 9 hours to finish running your program with dataset FB15k on my GPU server. Is there any method to make your program run faster, e.g., let 3 GPUs run in parallel,...?

DeepGraphLearning / KnowledgeGraphEmbedding

CUDA out of memory (resolved) and method to make RotatE run faster #51