Training Issue - Githubissues

ztrbq commented 9 months ago

Hi, thanks for your brilliant work.

I came across this error when training with 4 3090gpus: RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

I tried to set losses.backward(retain_graph=True) in engine.py, but I got another error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Since there's no multiple losses in the code, I don't know what cause this issue. Hope for your reply :)

xhowdonewd commented 9 months ago

I had the same problem, did you solve it

ztrbq commented 9 months ago

I had the same problem, did you solve it

I haven't solved this problem but I decide to give it up, it stuck me for over a week. :-(

dianzl commented 7 months ago

Hello! The reason for this problem is that the gradient of the previous frames is used multiple times during the temporal information extraction process. We have modified the codes of this part, and now it can work. Thank you for your issue.

ztrbq commented 7 months ago

Hello! The reason for this problem is that the gradient of the previous frames is used multiple times during the temporal information extraction process. We have modified the codes of this part, and now it can work. Thank you for your issue.

Thanks for your reply! I've tried to run this code on PKU-SOD dataset on one single GPU and it works fine. Thanks again for your great work.

However, when I tried to run this code on multiple GPUs (like 2 3090 gpus), the code seems stuck before loading the data. The terminal clip is like this: I checked the code but cannot find what causes this.

Hatins commented 4 months ago

@ztrbq Hi, it seems that we are schoolmates and working on an event-based object detection, could I have you number for communication?

ztrbq commented 4 months ago

@ztrbq Hi, it seems that we are schoolmates and working on an event-based object detection, could I have you number for communication?

sure, feel free to contact me at bingquanzhou@icloud.com.

ztrbq commented 4 months ago

Hello! The reason for this problem is that the gradient of the previous frames is used multiple times during the temporal information extraction process. We have modified the codes of this part, and now it can work. Thank you for your issue.

Thanks for your reply! I've tried to run this code on PKU-SOD dataset on one single GPU and it works fine. Thanks again for your great work.

However, when I tried to run this code on multiple GPUs (like 2 3090 gpus), the code seems stuck before loading the data. The terminal clip is like this: I checked the code but cannot find what causes this.

btw, the code seems to work fine suddenly. I'll close this issue, thanks again for your great work.

dianzl / SODFormer

Training Issue #1