Computation requirement and request for uploading the trained models

ljn0320 commented 3 years ago

您好，我对您的文章很感兴趣，里面有很多值得研究的内容，我按照您的readme部署，但是一直报错，有如下问题：1、必须通过git才能运行吗？2、是否对设备需求比较大，服务器120G还不太够，能否告知部署代码的设备要求？3、如果这边不能训练的话，能否分享一个训练好的模型以供复现学习？非常期待您的回复！！！

JiapengWu commented 3 years ago

Hi ljn,

I’m sorry about your trouble running the code. Here are my replies to your questions / requests:

I’m not sure what it means to “run the code using git”. It should be executed on your local machine or some remote cluster after downloading the code from this repo.
Yes the computation requirement is significant. I will make sure to include my script for slurm job submission.
I will upload the trained model this weekend to a google drive folder.

I hope they will address your issues.

Best,

Paul

On May 7, 2021, at 9:59 AM, ljn0320 @.**@.>> wrote:

您好，我对您的文章很感兴趣，里面有很多值得研究的内容，我按照您的readme部署，但是一直报错，有如下问题：1、必须通过git才能运行吗？2、是否对设备需求比较大，服务器120G还不太够，能否告知部署代码的设备要求？3、如果这边不能训练的话，能否分享一个训练好的模型以供复现学习？非常期待您的回复！！！

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/JiapengWu/TeMP/issues/7, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN2LVZ3IAVIGMLVEYDTETDTMPW2ZANCNFSM44KEASHA.

ljn0320 commented 3 years ago

感谢您的回复，期待您上传的模型~

JiapengWu commented 3 years ago

I have just uploaded the model, please check out the updated readme.

ljn0320 commented 3 years ago

OK，非常感谢！！

soledad921 commented 3 years ago

Hi, Thanks for your interesting work. I tried to reproduce your results but encountered the following memory issue:

Test tube created git tag: tt_BiGRRGCN-icews05-15-complex-5-0.1-time-embed-only-last-layer-no-dropout-not-learnable-score-ensemble-without-impute_v202105181830 gpu available: True, used: True VISIBLE GPUS: 0 Traceback (most recent call last): File "main.py", line 139, in File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/process.py", line 105, in start File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/context.py", line 284, in _Popen File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/reduction.py", line 60, in dump File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage RuntimeError: unable to open shared memory object in read-write mode

I tried the command "ulimit -SHn 51200", but it still did not work. I use a single Titan X (Pascal) and the same enviroment as described in the ReadMe file for training. Does the GPU device satisfy the computation requirement? How can I fix it? By decreasing the batch size or embedding size? Look forward to your reply. Thanks in advance.

JiapengWu commented 3 years ago

Hi soledad921,

Are you using ddp training?

On May 18, 2021, at 12:47 PM, soledad921 @.**@.>> wrote:

Hi, Thanks for your interesting work. I tried to reproduce your results but encountered the following memory issue:

Test tube created git tag: tt_BiGRRGCN-icews05-15-complex-5-0.1-time-embed-only-last-layer-no-dropout-not-learnable-score-ensemble-without-impute_v202105181830 gpu available: True, used: True VISIBLE GPUS: 0 Traceback (most recent call last): File "main.py", line 139, in File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 746, in fit File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/process.py", line 105, in start File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/context.py", line 284, in _Popen File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/multiprocessing/reduction.py", line 60, in dump File "/data/chengjin/anaconda3/envs/TeMP/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage RuntimeError: unable to open shared memory object in read-write mode

I tried the command "ulimit -SHn 51200", but it still did not work. I use a single Titan X (Pascal) for training. Does the GPU device satisfy the computation requirement? How can I fix it? By decreasing the batch size or embedding size? Look forward to your reply. Thanks in advance.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/JiapengWu/TeMP/issues/7#issuecomment-843349315, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN2LV2LORXZVXZBQ7DBFDLTOKK23ANCNFSM44KEASHA.

soledad921 commented 3 years ago

ddp training

I just followed the instruction and used the same enviroment

JiapengWu commented 3 years ago

What's the command that you are running?

soledad921 commented 3 years ago

What's the command that you are running?

I tried python -u main.py -c grid/icews15/config_bigrrgcn.json --rec-only-last-layer --use-time-embedding --post-ensemble python -u main.py -c grid/icews15/config_bisargcn.json --rec-only-last-layer --use-time-embedding --post-ensemble

JiapengWu commented 3 years ago

Please try to set the “distributed_backend” flag to “None” in the corresponding config.json, and rerun the command. Let me know if the same error occurs.

On May 18, 2021, at 12:53 PM, Jiapeng Wu @.***> wrote:

What's the command that you are running?

soledad921 commented 3 years ago

Please try to set the “distributed_backend” flag to “None” in the corresponding config.json, and rerun the command. Let me know if the same error occurs. … On May 18, 2021, at 12:53 PM, Jiapeng Wu @.***> wrote: What's the command that you are running?

It works now. It seems that I should ban the DDP setting when I use a single GPU device...Thanks a lot.

JiapengWu commented 3 years ago

No problem. I remember ICEWS05-15 might be the only dataset that has this glitch, ddp works well for other two dataset, even when you’re using 1 gpu.

ljn0320 commented 2 years ago

Please try to set the “distributedbackend” flag to “None” in the corresponding config.json, and rerun the command. Let me know if the same error occurs. … On May 18, 2021, at 12:53 PM, Jiapeng Wu @_.***> wrote: What's the command that you are running?

It works now. It seems that I should ban the DDP setting when I use a single GPU device...Thanks a lot.

Hello, there are some problems when I configure the environment. Is your environment configured according to ReadMe? My dgl can't find the version 0.4.1, and then there are many errors. What version is your dgl?

JiapengWu / TeMP

Computation requirement and request for uploading the trained models #7