CSTR-Edinburgh / merlin

This is now the official location of the Merlin project.
http://www.cstr.ed.ac.uk/projects/merlin/
Apache License 2.0
1.31k stars 441 forks source link

run_merlin.py (from Ossian repo) crashes on CPU #266

Open JRMeyer opened 6 years ago

JRMeyer commented 6 years ago

Hi Srikanth et al,

I'm running run_merlin.py (from Ossian) on a machine with 4 CPUs and 7.5 Gigs of memory, and it crashes when training the acoustic model.

Oct 21 23:36:46 ip-111-222-333-444 kernel: [128851.471423] Out of memory: Kill process 531 (python) score 833 or sacrifice child

I don't know if it's related, but I see only one CPU is getting used at a time, with 400% capacity, as if it's running 4 jobs on one core.

Ideas? Is this a "I need a bigger machine" problem or a Merlin problem?

-josh

ronanki commented 6 years ago

Can you change the buffer size in conf/acoustic_*.conf files and try?

JRMeyer commented 6 years ago

Hi Srikanth,

I'm training again now after halving the buffer size for duration and acoustic models in these two files:

Ossian$ ls scripts/merlin_interface/
feed_forward_dnn_ossian_acoustic_model.conf  feed_forward_dnn_ossian_duration_model.conf

As a reminder, I'm using Ossian, so I gather these are the correct conf files.

I'll report on results when training finishes, but in the mean time I wanted to post the following for documentation reasons.

Here's the Merlin output I had on the original crash, as opposed to the kernel log (which I posted above):

2017-10-21 23:35:14,904 DEBUG    main.train_DNN: calculating validation loss
2017-10-21 23:35:24,706 INFO     main.train_DNN: epoch 4, validation error 189.982375, train error 176.896324  time spent 660.15
2017-10-21 23:35:24,707 INFO           plotting: Generating a plot in file /home/ubuntu/Ossian/train/ky/speakers/atai/naive_01_nn//dnn_training_ACOUST//plots/training_convergence.pdf
./train_ky.sh: line 33:   531 Killed                  python ./tools/merlin/src/run_merlin.py ~/Ossian/train/${lang}/speakers/${voice}/${recipe}/processors/acoustic_predictor/config.cfg
JRMeyer commented 6 years ago

Update: Crashed again

I watched as the Memory usage crept up to 100%, as you can see from htop output:

htop

The script advanced all the way up to epoch 4, and then crashed. Why would it not crash on epoch 1 if this were a buffer issue?

any ideas on what to do?

2017-10-23 17:14:47,542 INFO     main.train_DNN: epoch 4, validation error 189.688105, train error 176.420401  time spent 654.52
2017-10-23 17:14:47,543 INFO           plotting: Generating a plot in file /home/ubuntu/Ossian/train/ky/speakers/atai/naive_01_nn//dnn_training_ACOUST//plots/training_convergence.pdf
2017-10-23 17:25:41,840 DEBUG    main.train_DNN: calculating validation loss
2017-10-23 17:25:43,300 CRITICAL       main    : train_DNN threw an exception
Traceback (most recent call last):
  File "./tools/merlin/src/run_merlin.py", line 1175, in <module>
    main_function(cfg)
  File "./tools/merlin/src/run_merlin.py", line 838, in main_function
    cmp_mean_vector = cmp_mean_vector, cmp_std_vector = cmp_std_vector)
  File "./tools/merlin/src/run_merlin.py", line 304, in train_DNN
    this_valid_loss = valid_fn()
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
  File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 872, in rval
    r = p(n, [x[0] for x in i], o)
  File "/usr/local/lib/python2.7/dist-packages/theano/tensor/blas.py", line 1544, in perform
    z[0] = numpy.asarray(numpy.dot(x, y))
MemoryError: 
Apply node that caused the error: Dot22(Elemwise{Composite{tanh((i0 + i1))}}[(0, 0)].0, W)
Toposort index: 10
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(29669, 1024), (1024, 1024)]
Inputs strides: [(8192, 8), (8192, 8)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Elemwise{Composite{tanh((i0 + i1))}}[(0, 0)](Dot22.0, InplaceDimShuffle{x,0}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
JRMeyer commented 6 years ago

Hi Srikanth,

Any ideas on debugging this?

-josh