Closed Qwlouse closed 9 years ago
Hi Klauss,
I just sent a note to @Deep to have him check on the Theano version. In the meantime, my first instinct is that this this some sort of memory allocation problem, perhaps on the GPU. I'm suspicious because this net is likely larger than any of the others due to the large value for dense
. Can you try editing final_nets.yml
so that the entry for this net looks like:
-
output_name: net_stf7_fea6_150e20_02_50_dense4096_3pVal
net: net_stf7
max_epochs: 150
patience: 20
min_freq: 0.2
max_freq: 50
valid_size: 4096
train_size: 20480
dense: 4096
batch_size: 16
validation: 6
rate: 0.04
-
This will drop the batch_size from the default 32 to 16 and drop the memory usage on the GPU by a factor of 2. This will change the result a bit since changing the mini batch size changes the path of the gradient descent algorithm, but it should be close.
Hi Klauss, As I put it in my comment in the previous closed issue, the difference between our setup and yours are: python==2.7
nolearn==0.6adev numpy==1.9.2 pandas=0.15.2 pytz==2012c scipy==0.13.3 six==1.5.2
Theano, we are using 0.7.0 just like you. Since Lasagne and hence nolearn are mostly python and wrapped on top of theano, I don't suspect any version mismatch issue as the wrapped c should be all the way down to Theano and our version is the same.
My best guess, as Tim pointed out, is a memory/swap issue. This is the largest net of all as it was trained using the largest portion of the training dataset and if I remember it correctly you said your setup is 16GM RAM (that's half the size of our setup and hence the most likely suspect) and I am not sure how big your GPU memory is??? Ours was 4GB in each of our GTX 980.
Segmentation fault usssually occurs when a pointer is trying to reference to unallocated or restricted memory space. So, my best guess is the c code below theano is trying to allocate memory that failed and another part of the code is trying to access that unallocated memory. Accessing the core dump could shed light on what went wrong. The memory consumption of our code increases as the training progresses and as a program crashes sometimes we noticed that the process still remains as a 'zombie' process and we didn't realize that there were too many of them and it crashed our run as the GPU ran out of memory. This can explain why the restart crashed on earlier subject than the first run.
So, my suggestion is to run this net in a bit larger memory setup. If that is not possible, see the core dump to exactly debug what went wrong and reduce the dense of the net. I am pretty confident the performance of this net won't be too much degraded if the dense is reduced by a factor of 2, i.e. 2048.
Let us know for further help. Esube
I'm not sure what the issue was in the end, but RAM is indeed a very likely suspect. My RAM and swap were both filling up almost to max capacity during training. Anyways, I managed to train the network, so I'm closing this issue. Thanks for your support!
So I managed to train most models, but this one keeps producing with a Segmentation fault:
13 net_stf7_fea6_150e20_02_50_dense4096_3pVal
I started it with:
The first time (if I interpret the outputs correctly) it got till after the 12. subject:
I then just restarted it and it failed again, but already for subject 3:
Since you didn't write any C code yourself, I assume its an issue with Theano. Maybe you can find out exactly which version of theano/lasagne you have been using?