Open JRMeyer opened 6 years ago
Can you change the buffer size in conf/acoustic_*.conf files and try?
Hi Srikanth,
I'm training again now after halving the buffer size for duration and acoustic models in these two files:
Ossian$ ls scripts/merlin_interface/
feed_forward_dnn_ossian_acoustic_model.conf feed_forward_dnn_ossian_duration_model.conf
As a reminder, I'm using Ossian, so I gather these are the correct conf
files.
I'll report on results when training finishes, but in the mean time I wanted to post the following for documentation reasons.
Here's the Merlin output I had on the original crash, as opposed to the kernel log (which I posted above):
2017-10-21 23:35:14,904 DEBUG main.train_DNN: calculating validation loss
2017-10-21 23:35:24,706 INFO main.train_DNN: epoch 4, validation error 189.982375, train error 176.896324 time spent 660.15
2017-10-21 23:35:24,707 INFO plotting: Generating a plot in file /home/ubuntu/Ossian/train/ky/speakers/atai/naive_01_nn//dnn_training_ACOUST//plots/training_convergence.pdf
./train_ky.sh: line 33: 531 Killed python ./tools/merlin/src/run_merlin.py ~/Ossian/train/${lang}/speakers/${voice}/${recipe}/processors/acoustic_predictor/config.cfg
Update: Crashed again
I watched as the Memory usage crept up to 100%, as you can see from htop
output:
The script advanced all the way up to epoch 4
, and then crashed. Why would it not crash on epoch 1
if this were a buffer issue?
any ideas on what to do?
2017-10-23 17:14:47,542 INFO main.train_DNN: epoch 4, validation error 189.688105, train error 176.420401 time spent 654.52
2017-10-23 17:14:47,543 INFO plotting: Generating a plot in file /home/ubuntu/Ossian/train/ky/speakers/atai/naive_01_nn//dnn_training_ACOUST//plots/training_convergence.pdf
2017-10-23 17:25:41,840 DEBUG main.train_DNN: calculating validation loss
2017-10-23 17:25:43,300 CRITICAL main : train_DNN threw an exception
Traceback (most recent call last):
File "./tools/merlin/src/run_merlin.py", line 1175, in <module>
main_function(cfg)
File "./tools/merlin/src/run_merlin.py", line 838, in main_function
cmp_mean_vector = cmp_mean_vector, cmp_std_vector = cmp_std_vector)
File "./tools/merlin/src/run_merlin.py", line 304, in train_DNN
this_valid_loss = valid_fn()
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 898, in __call__
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 884, in __call__
self.fn() if output_subset is None else\
File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 872, in rval
r = p(n, [x[0] for x in i], o)
File "/usr/local/lib/python2.7/dist-packages/theano/tensor/blas.py", line 1544, in perform
z[0] = numpy.asarray(numpy.dot(x, y))
MemoryError:
Apply node that caused the error: Dot22(Elemwise{Composite{tanh((i0 + i1))}}[(0, 0)].0, W)
Toposort index: 10
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(29669, 1024), (1024, 1024)]
Inputs strides: [(8192, 8), (8192, 8)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Elemwise{Composite{tanh((i0 + i1))}}[(0, 0)](Dot22.0, InplaceDimShuffle{x,0}.0)]]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
Hi Srikanth,
Any ideas on debugging this?
-josh
Hi Srikanth et al,
I'm running run_merlin.py (from Ossian) on a machine with 4 CPUs and 7.5 Gigs of memory, and it crashes when training the acoustic model.
I don't know if it's related, but I see only one CPU is getting used at a time, with 400% capacity, as if it's running 4 jobs on one core.
Ideas? Is this a "I need a bigger machine" problem or a Merlin problem?
-josh