Graph4KG的ComplEx跑验证集时Out of Memory

xiehuanyi commented 1 year ago

使用ai studio上的32g v100在OpenBG500数据集上跑RotatE模型，训练集正常，测试集无论bs调多少都会OOM。

我的代码:

!python -u train.py --model_name RotatE \
                    --data_name  OpenBG500\
                    --data_path  /home/aistudio/data/\
                    --save_path /home/aistudio/result/Rotate --max_steps 1\
                    --batch_size 1 --log_interval 1000 --eval_interval 20000 --reg_coef 1e-7 --reg_norm 3 \
                    --neg_sample_size 256 --neg_sample_type 'chunk' --embed_dim 200 --gamma 12.0 --lr 0.018 --optimizer adagrad -adv \
                    --num_workers 2 --num_epoch 30 --print_on_screen --filter_eval --neg_deg_sample --valid

因为一直报错，我就把max_steps设为1，事实上模型的训练很正常，但是测试的时候会oom

----------------------------------------
        Device Setting        
----------------------------------------
 Entity   embedding place: gpu
 Relation embedding place: gpu
----------------------------------------
----------------------------------------
       Embedding Setting      
----------------------------------------
 Entity   embedding dimension: 400
 Relation embedding dimension: 200
----------------------------------------
2022-12-03 20:54:31,717 INFO     seed                :0
2022-12-03 20:54:31,718 INFO     data_path           :/home/aistudio/data/
2022-12-03 20:54:31,718 INFO     save_path           :/home/aistudio/result/Rotate/rotate_OpenBG500_d_200_g_12.0_e_gpu_r_gpu_l_Logsigmoid_lr_0.018_0.1_KGE
2022-12-03 20:54:31,718 INFO     init_from_ckpt      :None
2022-12-03 20:54:31,718 INFO     data_name           :OpenBG500
2022-12-03 20:54:31,718 INFO     use_dict            :False
2022-12-03 20:54:31,718 INFO     kv_mode             :False
2022-12-03 20:54:31,718 INFO     batch_size          :1
2022-12-03 20:54:31,718 INFO     test_batch_size     :16
2022-12-03 20:54:31,718 INFO     neg_sample_size     :256
2022-12-03 20:54:31,718 INFO     filter_eval         :True
2022-12-03 20:54:31,718 INFO     model_name          :rotate
2022-12-03 20:54:31,718 INFO     embed_dim           :200
2022-12-03 20:54:31,718 INFO     reg_coef            :1e-07
2022-12-03 20:54:31,718 INFO     loss_type           :Logsigmoid
2022-12-03 20:54:31,718 INFO     max_steps           :1
2022-12-03 20:54:31,718 INFO     lr                  :0.018
2022-12-03 20:54:31,718 INFO     optimizer           :adagrad
2022-12-03 20:54:31,718 INFO     cpu_lr              :0.1
2022-12-03 20:54:31,718 INFO     cpu_optimizer       :adagrad
2022-12-03 20:54:31,719 INFO     mix_cpu_gpu         :False
2022-12-03 20:54:31,719 INFO     async_update        :False
2022-12-03 20:54:31,719 INFO     valid               :True
2022-12-03 20:54:31,719 INFO     test                :False
2022-12-03 20:54:31,719 INFO     task_name           :KGE
2022-12-03 20:54:31,719 INFO     num_workers         :2
2022-12-03 20:54:31,719 INFO     neg_sample_type     :chunk
2022-12-03 20:54:31,719 INFO     neg_deg_sample      :True
2022-12-03 20:54:31,719 INFO     neg_adversarial_sampling:True
2022-12-03 20:54:31,719 INFO     adversarial_temperature:1.0
2022-12-03 20:54:31,719 INFO     filter_sample       :False
2022-12-03 20:54:31,719 INFO     valid_percent       :1.0
2022-12-03 20:54:31,719 INFO     use_feature         :False
2022-12-03 20:54:31,719 INFO     reg_type            :norm_er
2022-12-03 20:54:31,719 INFO     reg_norm            :3
2022-12-03 20:54:31,719 INFO     weighted_loss       :False
2022-12-03 20:54:31,719 INFO     margin              :1.0
2022-12-03 20:54:31,719 INFO     pairwise            :False
2022-12-03 20:54:31,719 INFO     gamma               :12.0
2022-12-03 20:54:31,719 INFO     ote_scale           :0
2022-12-03 20:54:31,719 INFO     ote_size            :1
2022-12-03 20:54:31,719 INFO     quate_lmbda1        :0.0
2022-12-03 20:54:31,719 INFO     quate_lmbda2        :0.0
2022-12-03 20:54:31,719 INFO     num_epoch           :30
2022-12-03 20:54:31,719 INFO     scheduler_interval  :-1
2022-12-03 20:54:31,720 INFO     num_process         :1
2022-12-03 20:54:31,720 INFO     print_on_screen     :True
2022-12-03 20:54:31,720 INFO     log_interval        :1000
2022-12-03 20:54:31,720 INFO     save_interval       :-1
2022-12-03 20:54:31,720 INFO     eval_interval       :20000
2022-12-03 20:54:31,720 INFO     ent_emb_on_cpu      :False
2022-12-03 20:54:31,720 INFO     rel_emb_on_cpu      :False
2022-12-03 20:54:31,720 INFO     use_embedding_regularization:True
2022-12-03 20:54:31,720 INFO     ent_dim             :400
2022-12-03 20:54:31,720 INFO     rel_dim             :200
2022-12-03 20:54:31,720 INFO     num_chunks          :1
W1203 20:54:51.171296 20466 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1203 20:54:51.174912 20466 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py:3983: DeprecationWarning: Op `adagrad` is executed through `append_op` under the dynamic mode, the corresponding API implementation needs to be upgraded to using `_C_ops` method.
  DeprecationWarning,
2022-12-03 20:54:54,271 INFO     [evaluation] start...
  0%|                                                   | 0/313 [00:00<?, ?it/s]terminate called after throwing an instance of 'paddle::memory::allocation::BadAlloc'
  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   multiply_ad_func(paddle::experimental::Tensor const&, paddle::experimental::Tensor const&)
1   paddle::experimental::multiply(paddle::experimental::Tensor const&, paddle::experimental::Tensor const&)
2   void phi::MultiplyRawKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, int, phi::DenseTensor*)
3   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
4   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, paddle::experimental::DataType, unsigned long, bool) const
5   phi::DenseTensor::AllocateFrom(phi::Allocator*, paddle::experimental::DataType, unsigned long)
6   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
7   paddle::memory::allocation::Allocator::Allocate(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
11  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
12  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 2.977216GB memory on GPU 0, 29.256836GB memory has been allocated and available memory is only 2.491699GB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is `export FLAGS_use_cuda_managed_memory=false`.
 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1670072104 (unix time) try "date -d @1670072104" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x3e800004ff2) received by PID 20466 (TID 0x7f2dee9e2700) from PID 20466 ***]

LemonNoel commented 1 year ago

测试时对应的参数为 test_batch_size，调小 batch 看下。OpenGB500 有 249,743 个实体，我这边测应该是占 31.8 G
如果有修改代码，另一种可能是传入数据的 shape 不对造成错误的 broadcast，导致显存不正常溢出，可以打印下 RotateScore 函数的输入参数shape看看是不是 [batch_size, 1, ent_embed_dim], [batch_size, 1, rel_embed_dim], [batch_size, candidate_num, ent_embed_dim], 文件位置为这里

xiehuanyi commented 1 year ago

感谢我试试

	@.***

@.*** |

---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2022年12月06日 11:29 | | 收件人 | @.> | | 抄送至 | Huanyi @.>@.> | | 主题 | Re: [PaddlePaddle/PGL] Graph4KG的ComplEx跑验证集时Out of Memory (Issue #505) |

测试时对应的参数为 test_batch_size，调小 batch 看下。OpenGB500 有 249,743 个实体，我这边测应该是占 31.8 G 如果有修改代码，另一种可能是传入数据的 shape 不对造成错误的 broadcast，导致显存不正常溢出，可以打印下 RotateScore 函数的输入参数shape看看是不是 [batch_size, 1, ent_embed_dim], [batch_size, 1, rel_embed_dim], [batch_size, candidate_num, ent_embed_dim], 文件位置为这里

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

PaddlePaddle / PGL

Graph4KG的ComplEx跑验证集时Out of Memory #505