Closed arch-user-france1 closed 7 months ago
Have you tried to reduce memory consumption? Reduce batch size.
IIRC these issues sometimes happen when no memory left and AMD driver generally allows over-committing memory such that sometimes you may think you use GPU memory but it is paging it to CPU.
I would start from that.
Hmm, no, I am sure it was not memory. I've got 40GB and not more than 4GB were used. Also consider that it came up pretty random - and now I've been training with a large batch size of 128 and there was no such issue.
But while it was training it just suddenly got stuck - the python process would not finish (UGH NO GOD while writing on another device the display output got messed again - everything crashed) and even after killall python
the GPU remained at 100%.
This is what just happened when I tried logging into the crashed system.
AMD Driver issue
Hello
I seem to have stumbled over a bug in the program - at first it all run fine, but while I was fiddling and with some parameters restarting the training multiple times the code suddenly crashed.
Here's what happened:
It crashed after the first epoch, probably in the middle or the start of the first one.
It has been training a DCGAN network - see the architecture:
I guess it'll take you a hard time debugging, especially since I have not provided you any further information. If you would like me to try something to discover the bug in the code, I would be happy to help.
It could as well be a problem with the underlying driver, because I have seen abrupt crashes of the screen output and things messing up. I doubt the card arrived broken, because on Windows it's running okay.
Happy working and have a great time