Question when code running

NeuSyz commented 3 years ago

When I run your code with your provided config files, some models get stuck in a certain epoch of training. For example SRGCN get stuck in 69 epoch. I hope your can check and run it. Thanks!

JiapengWu commented 3 years ago

Can you provide more details of the training logs？

On Apr 17, 2021, at 9:35 AM, NEUsyz @.**@.>> wrote:

When I run your code with your provided config files, some models get stuck in a certain epoch of training. For example SRGCN get stuck in 69 epoch. I hope your can check and run it. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/JiapengWu/TeMP/issues/4, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN2LV2OBMGFZHVY4SDLMCTTJGFCHANCNFSM43DCWANA.

NeuSyz commented 3 years ago

我在训练SARGCN 和SRGCN遇到了这个问题，我发现每一轮训练内存占用在一直增加，最后程序崩溃卡在了某一轮训练。其中我采用的是单gpu训练(distributed_backend设置为空)，您是否出现过类似问题。

JiapengWu commented 3 years ago

Hi,

Yes this is indeed a problem…It’s mostly caused by the for loops that involves GPU computation, for example line 187 - 194 of DynamicRGCN.py. I removed as many for loops as possible and tried multiple memory optimization techniques including frequent garbage collection, and the current version is the best I could do. Without further optimization, please make sure to allocate at least 120G memory for training. Sorry about the inconvenience. Using distributed_backend option is always recommended, even when for single-CPU training.

Best,

Jiapeng

On Apr 20, 2021, at 4:59 AM, NEUsyz @.**@.>> wrote:

我在训练SARGCN 和SRGCN遇到了这个问题，我发现每一轮训练内存占用在一直增加，最后程序崩溃卡在了某一轮训练。其中我采用的是单gpu训练(distributed_backend设置为空)，您是否出现过类似问题。

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/JiapengWu/TeMP/issues/4#issuecomment-823108517, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN2LV3D6M5D7FKVCUIQFALTJU67ZANCNFSM43DCWANA.

NeuSyz commented 3 years ago

Thank you for your answer

JiapengWu commented 3 years ago

I have just uploaded some trained model, please check out the updated readme.

chuhang123 commented 3 years ago

@NeuSyz 这个问题你最后怎么解决的呀，我也是训练内存占用一直增加，最后被杀死。学校的服务器内存肯定没有120G呀，，，我尝试把batch_size调成1，embbedding_size下调到32也不行，，，，

NeuSyz commented 3 years ago

@NeuSyz 这个问题你最后怎么解决的呀，我也是训练内存占用一直增加，最后被杀死。学校的服务器内存肯定没有120G呀，，，我尝试把batch_size调成1，embbedding_size下调到32也不行，，，，

我目前也没法解决，这个循环没办法舍弃

JiapengWu / TeMP

Question when code running #4