Open ljn0320 opened 3 years ago
Hi ljn,
I’m sorry about your trouble running the code. Here are my replies to your questions / requests:
I hope they will address your issues.
Best,
Paul
On May 7, 2021, at 9:59 AM, ljn0320 @.**@.>> wrote:
您好,我对您的文章很感兴趣,里面有很多值得研究的内容,我按照您的readme部署,但是一直报错,有如下问题:1、必须通过git才能运行吗?2、是否对设备需求比较大,服务器120G还不太够,能否告知部署代码的设备要求?3、如果这边不能训练的话,能否分享一个训练好的模型以供复现学习?非常期待您的回复!!!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/JiapengWu/TeMP/issues/7, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN2LVZ3IAVIGMLVEYDTETDTMPW2ZANCNFSM44KEASHA.
感谢您的回复,期待您上传的模型~
I have just uploaded the model, please check out the updated readme.
OK,非常感谢!!
Hi, Thanks for your interesting work. I tried to reproduce your results but encountered the following memory issue:
Test tube created git tag: tt_BiGRRGCN-icews05-15-complex-5-0.1-time-embed-only-last-layer-no-dropout-not-learnable-score-ensemble-without-impute_v202105181830
gpu available: True, used: True
VISIBLE GPUS: 0
Traceback (most recent call last):
File "main.py", line 139, in
I tried the command "ulimit -SHn 51200", but it still did not work. I use a single Titan X (Pascal) and the same enviroment as described in the ReadMe file for training. Does the GPU device satisfy the computation requirement? How can I fix it? By decreasing the batch size or embedding size? Look forward to your reply. Thanks in advance.
Hi soledad921,
Are you using ddp training?
On May 18, 2021, at 12:47 PM, soledad921 @.**@.>> wrote:
Hi, Thanks for your interesting work. I tried to reproduce your results but encountered the following memory issue:
Test tube created git tag: tt_BiGRRGCN-icews05-15-complex-5-0.1-time-embed-only-last-layer-no-dropout-not-learnable-score-ensemble-without-impute_v202105181830 gpu available: True, used: True VISIBLE GPUS: 0 Traceback (most recent call last): File "main.py", line 139, in File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/process.py", line 105, in start File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/context.py", line 284, in _Popen File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/reduction.py", line 60, in dump File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage RuntimeError: unable to open shared memory object in read-write mode
I tried the command "ulimit -SHn 51200", but it still did not work. I use a single Titan X (Pascal) for training. Does the GPU device satisfy the computation requirement? How can I fix it? By decreasing the batch size or embedding size? Look forward to your reply. Thanks in advance.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/JiapengWu/TeMP/issues/7#issuecomment-843349315, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN2LV2LORXZVXZBQ7DBFDLTOKK23ANCNFSM44KEASHA.
ddp training
I just followed the instruction and used the same enviroment
What's the command that you are running?
What's the command that you are running?
I tried python -u main.py -c grid/icews15/config_bigrrgcn.json --rec-only-last-layer --use-time-embedding --post-ensemble python -u main.py -c grid/icews15/config_bisargcn.json --rec-only-last-layer --use-time-embedding --post-ensemble
Please try to set the “distributed_backend” flag to “None” in the corresponding config.json, and rerun the command. Let me know if the same error occurs.
On May 18, 2021, at 12:53 PM, Jiapeng Wu @.***> wrote:
What's the command that you are running?
Please try to set the “distributed_backend” flag to “None” in the corresponding config.json, and rerun the command. Let me know if the same error occurs. … On May 18, 2021, at 12:53 PM, Jiapeng Wu @.***> wrote: What's the command that you are running?
It works now. It seems that I should ban the DDP setting when I use a single GPU device...Thanks a lot.
No problem. I remember ICEWS05-15 might be the only dataset that has this glitch, ddp works well for other two dataset, even when you’re using 1 gpu.
Please try to set the “distributedbackend” flag to “None” in the corresponding config.json, and rerun the command. Let me know if the same error occurs. … On May 18, 2021, at 12:53 PM, Jiapeng Wu @_.***> wrote: What's the command that you are running?
It works now. It seems that I should ban the DDP setting when I use a single GPU device...Thanks a lot.
Hello, there are some problems when I configure the environment. Is your environment configured according to ReadMe? My dgl can't find the version 0.4.1, and then there are many errors. What version is your dgl?
您好,我对您的文章很感兴趣,里面有很多值得研究的内容,我按照您的readme部署,但是一直报错,有如下问题:1、必须通过git才能运行吗?2、是否对设备需求比较大,服务器120G还不太够,能否告知部署代码的设备要求?3、如果这边不能训练的话,能否分享一个训练好的模型以供复现学习?非常期待您的回复!!!