agethen / ConvLSTM-for-Caffe

28 stars 15 forks source link

Computer Crashes after 5 Hours of Training ConvLSTM Network #11

Closed GYengera closed 7 years ago

GYengera commented 7 years ago

I am training a network with convolutional LSTM cells. After about 5 hours of training, my computer crashes. It has happened twice. The first time the error said that GPU ID 0 could not be found, which was the GPU I was training on. The second time, the error was 'packet_write_wait : broken pipe'. Does this have to do with a bug in the convolutional LSTM source code?

agethen commented 7 years ago

I don't think that is related to my code. The first problem might be a hardware issue? The second error: I take it you were connected to a server via ssh? It means that the connection to the server was interrupted for some reason. I would strongly suggest to run caffe on the server with a tool like "screen", such that the process will not be killed when your connection is lost.

GYengera commented 7 years ago

It was strange. Code works perfectly fine on one server, while on another it keep crashing. No idea what the problem is, but probably has nothing to do with your code. Thanks for your help. Thanks for the suggestion too.