Segfault while training model 13

Qwlouse commented 9 years ago

So I managed to train most models, but this one keeps producing with a Segmentation fault: 13 net_stf7_fea6_150e20_02_50_dense4096_3pVal

I started it with:

THEANO_FLAGS=device=gpu0,floatX=float32 python submission.py -r run -n 13

The first time (if I interpret the outputs correctly) it got till after the 12. subject:

[...]
Subject: 12
Loading 12 1
Band pass filtering, f_l = 0.2 f_h = 50
Loading 12 2
Band pass filtering, f_l = 0.2 f_h = 50
Loading 12 3
Band pass filtering, f_l = 0.2 f_h = 50
Loading 12 4
Band pass filtering, f_l = 0.2 f_h = 50
Loading 12 5
Band pass filtering, f_l = 0.2 f_h = 50
Loading 12 6
Band pass filtering, f_l = 0.2 f_h = 50
Loading 12 7
Band pass filtering, f_l = 0.2 f_h = 50
Loading 12 8
Band pass filtering, f_l = 0.2 f_h = 50
Loaded parameters to layer 'layer2' (shape 6x32x5).
Loaded parameters to layer 'layer2' (shape 6).
Loaded parameters to layer 'layer6' (shape 96x6).
Loaded parameters to layer 'layer6' (shape 6).
Loaded parameters to layer 'layer7' (shape 32x6x7).
Loaded parameters to layer 'layer7' (shape 32).
Loaded parameters to layer 'last_conv' (shape 64x32x5).
Loaded parameters to layer 'last_conv' (shape 64x128).
Loaded parameters to layer 'layer18' (shape 736x16384).
Loaded parameters to layer 'layer18' (shape 16384).
Loaded parameters to layer 'layer21' (shape 4096x16384).
Loaded parameters to layer 'layer21' (shape 16384).
Loaded parameters to layer 'output' (shape 4096x6).
Loaded parameters to layer 'output' (shape 6).
# Neural Network with 79246194 learnable parameters

## Layer information

  #  name       size
---  ---------  -------
  0  layer1     32x4096
  1  layer2     6x4096
  2  layer3     4096x6
  3  layer4     256x96
  4  layer5     96x256
  5  layer6     6x256
  6  layer7     32x256
  7  layer8     32x128
  8  last_conv  64x128
  9  layer10    32x128
 10  layer11    32x15
 11  all_time   480
 12  layer13    64x8
 13  layer14    32x8
 14  recent     256
 15  layer16    736
 16  layer17    736
 17  layer18    16384
 18  layer19    4096
 19  layer20    4096
 20  layer21    16384
 21  layer22    4096
 22  layer23    4096
 23  output     6

  epoch    train loss    valid loss    train/val  dur
-------  ------------  ------------  -----------  ------
      1       0.08145       0.07737      1.05283  31.41s
      2       0.06981       0.07409      0.94226  31.54s
      3       0.06599       0.06706      0.98401  31.76s
      4       0.06161       0.06165      0.99926  31.81s
      5       0.06102       0.07220      0.84506  31.62s
      6       0.05734       0.06241      0.91874  31.49s
      7       0.05715       0.05526      1.03414  31.70s
      8       0.05620       0.05861      0.95883  31.59s
      9       0.05566       0.05838      0.95354  31.73s
     10       0.05207       0.06363      0.81831  31.65s
     11       0.05129       0.05960      0.86062  31.65s
     12       0.05178       0.05531      0.93603  31.75s
     13       0.05093       0.05726      0.88940  31.57s
     14       0.04802       0.06019      0.79776  31.85s
     15       0.04654       0.05920      0.78617  31.50s
     16       0.04774       0.05754      0.82965  31.53s
     17       0.04669       0.05712      0.81737  31.44s
     18       0.04423       0.06129      0.72159  31.61s
     19       0.04432       0.05502      0.80560  31.63s
     20       0.04386       0.05743      0.76364  31.61s
     21       0.04166       0.05586      0.74583  31.57s
     22       0.04222       0.05859      0.72055  31.52s
     23       0.04058       0.05726      0.70866  31.79s
     24       0.03924       0.05761      0.68113  31.88s
     25       0.03934       0.05626      0.69920  31.53s
     26       0.03824       0.05915      0.64653  31.60s
     27       0.03705       0.05981      0.61934  31.65s
     28       0.03678       0.05854      0.62828  31.71s
     29       0.03528       0.05787      0.60967  31.66s
     30       0.03577       0.05896      0.60680  31.75s
     31       0.03456       0.05900      0.58577  31.49s
     32       0.03399       0.05900      0.57605  31.65s
     33       0.03410       0.05716      0.59653  31.55s
     34       0.03278       0.05749      0.57022  31.64s
     35       0.03267       0.06005      0.54398  31.65s
     36       0.03218       0.06061      0.53084  31.52s
     37       0.03272       0.05713      0.57274  31.75s
     38       0.03233       0.05827      0.55491  31.76s
     39       0.03051       0.06176      0.49403  31.77s
Early stopping.
Best valid accuracy was 0.055016 at epoch 19.
Loaded parameters to layer 'layer2' (shape 6x32x5).
Loaded parameters to layer 'layer2' (shape 6).
Loaded parameters to layer 'layer6' (shape 96x6).
Loaded parameters to layer 'layer6' (shape 6).
Loaded parameters to layer 'layer7' (shape 32x6x7).
Loaded parameters to layer 'layer7' (shape 32).
Loaded parameters to layer 'last_conv' (shape 64x32x5).
Loaded parameters to layer 'last_conv' (shape 64x128).
Loaded parameters to layer 'layer18' (shape 736x16384).
Loaded parameters to layer 'layer18' (shape 16384).
Loaded parameters to layer 'layer21' (shape 4096x16384).
Loaded parameters to layer 'layer21' (shape 16384).
Loaded parameters to layer 'output' (shape 4096x6).
Loaded parameters to layer 'output' (shape 6).
Loading 12 2
Band pass filtering, f_l = 0.2 f_h = 50
Segmentation fault (core dumped)

I then just restarted it and it failed again, but already for subject 3:

[...]
Subject: 3
Loading 3 1
Band pass filtering, f_l = 0.2 f_h = 50
Loading 3 2
Band pass filtering, f_l = 0.2 f_h = 50
Loading 3 3
Band pass filtering, f_l = 0.2 f_h = 50
Loading 3 4
Band pass filtering, f_l = 0.2 f_h = 50
Loading 3 5
Band pass filtering, f_l = 0.2 f_h = 50
Loading 3 6
Band pass filtering, f_l = 0.2 f_h = 50
Loading 3 7
Band pass filtering, f_l = 0.2 f_h = 50
Loading 3 8
Band pass filtering, f_l = 0.2 f_h = 50
Segmentation fault (core dumped)

Since you didn't write any C code yourself, I assume its an issue with Theano. Maybe you can find out exactly which version of theano/lasagne you have been using?

bitsofbits commented 9 years ago

Hi Klauss,

I just sent a note to @Deep to have him check on the Theano version. In the meantime, my first instinct is that this this some sort of memory allocation problem, perhaps on the GPU. I'm suspicious because this net is likely larger than any of the others due to the large value for dense. Can you try editing final_nets.yml so that the entry for this net looks like:

-
    output_name: net_stf7_fea6_150e20_02_50_dense4096_3pVal
    net: net_stf7
    max_epochs: 150
    patience: 20
    min_freq: 0.2
    max_freq: 50
    valid_size: 4096
    train_size: 20480
    dense: 4096
    batch_size: 16
    validation: 6
    rate: 0.04
-

This will drop the batch_size from the default 32 to 16 and drop the memory usage on the GPU by a factor of 2. This will change the result a bit since changing the mini batch size changes the path of the gradient descent algorithm, but it should be close.

esube commented 9 years ago

Hi Klauss, As I put it in my comment in the previous closed issue, the difference between our setup and yours are: python==2.7

nolearn==0.6adev numpy==1.9.2 pandas=0.15.2 pytz==2012c scipy==0.13.3 six==1.5.2

Theano, we are using 0.7.0 just like you. Since Lasagne and hence nolearn are mostly python and wrapped on top of theano, I don't suspect any version mismatch issue as the wrapped c should be all the way down to Theano and our version is the same.

My best guess, as Tim pointed out, is a memory/swap issue. This is the largest net of all as it was trained using the largest portion of the training dataset and if I remember it correctly you said your setup is 16GM RAM (that's half the size of our setup and hence the most likely suspect) and I am not sure how big your GPU memory is??? Ours was 4GB in each of our GTX 980.

Segmentation fault usssually occurs when a pointer is trying to reference to unallocated or restricted memory space. So, my best guess is the c code below theano is trying to allocate memory that failed and another part of the code is trying to access that unallocated memory. Accessing the core dump could shed light on what went wrong. The memory consumption of our code increases as the training progresses and as a program crashes sometimes we noticed that the process still remains as a 'zombie' process and we didn't realize that there were too many of them and it crashed our run as the GPU ran out of memory. This can explain why the restart crashed on earlier subject than the first run.

So, my suggestion is to run this net in a bit larger memory setup. If that is not possible, see the core dump to exactly debug what went wrong and reduce the dense of the net. I am pretty confident the performance of this net won't be too much degraded if the dense is reduced by a factor of 2, i.e. 2048.

Let us know for further help. Esube

Qwlouse commented 9 years ago

I'm not sure what the issue was in the end, but RAM is indeed a very likely suspect. My RAM and swap were both filling up almost to max capacity during training. Anyways, I managed to train the network, so I'm closing this issue. Thanks for your support!

bitsofbits / kaggle_grasp_and_lift_eeg_detection

Segfault while training model 13 #2