Running out of RAM, when training bigger AVM

martedva commented 6 years ago

Hello,

I am trying to train an AVM for speaker adaptation, which contains 50 speakers with approximately 400 sentences pr speaker. This sums up to a total of 18193 utterances for training. When i run this on my computer i get an out of memory exception as shown below despite using a SWAP partition to safeguard for this. Could this be due to something else, or is the solution simply to have more actual RAM?

2018-04-11 09:05:43,466 INFO           main    : Input label dimension is 425
2018-04-11 09:05:43,552 INFO           main    : label dimension is 425
2018-04-11 09:05:43,578 INFO           main    : training DNN
2018-04-11 09:22:29,746 INFO     main.train_DNN: building the model
2018-04-11 09:22:49,441 INFO     main.train_DNN: fine-tuning the DNN model
2018-04-11 09:22:53,602 CRITICAL       main    : train_DNN threw an exception
Traceback (most recent call last):
  File "/home/tts/merlin/src/run_merlin.py", line 1313, in <module>
    main_function(cfg)
  File "/home/tts/merlin/src/run_merlin.py", line 863, in main_function
    cmp_mean_vector = cmp_mean_vector, cmp_std_vector = cmp_std_vector,init_dnn_model_file=cfg.start_from_trained_model)
  File "/home/tts/merlin/src/run_merlin.py", line 351, in train_DNN
    this_train_error = train_fn(current_finetune_lr, current_momentum)
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 917, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 903, in __call__
    self.fn() if output_subset is None else\
  File "pygpu/gpuarray.pyx", line 682, in pygpu.gpuarray.pygpu_zeros
  File "pygpu/gpuarray.pyx", line 693, in pygpu.gpuarray.pygpu_empty
  File "pygpu/gpuarray.pyx", line 301, in pygpu.gpuarray.array_empty
pygpu.gpuarray.GpuArrayException: cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Apply node that caused the error: GpuAlloc<None>{memset_0=True}(GpuArrayConstant{0.0}, Elemwise{add,no_inplace}.0, Elemwise{Composite{Switch(EQ(i0, i1), ((i2 * i3 * i4) // maximum((-(i0 * i4)), i5)), i0)}}.0, Shape_i{1}.0)
Toposort index: 171
Inputs types: [GpuArrayType<None>(float32, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar)]
Inputs shapes: [(), (), (), ()]
Inputs strides: [(), (), (), ()]
Inputs values: [gpuarray.array(0., dtype=float32), array(51), array(1977), array(512)]
Outputs clients: [[forall_inplace,gpu,grad_of_scan_fn}(Shape_i{0}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{(i0 - sqr(i1))}}[]<gpuarray>.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, InplaceGpuDimShuffle{0,2,1}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{::int64}.0, GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, Shape_i{0}.0, W_ho, W_hf, W_hi, W_hc, InplaceGpuDimShuffle{1,0}.0, InplaceGpuDimShuffle{x,0}.0, InplaceGpuDimShuffle{x,0}.0, InplaceGpuDimShuffle{x,0}.0, InplaceGpuDimShuffle{x,0}.0, InplaceGpuDimShuffle{1,0}.0, InplaceGpuDimShuffle{1,0}.0, InplaceGpuDimShuffle{x,0}.0, InplaceGpuDimShuffle{1,0}.0, InplaceGpuDimShuffle{x,0}.0, InplaceGpuDimShuffle{x,0}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1326, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1021, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1326, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1021, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1326, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1021, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1326, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/home/tts/anaconda2/lib/python2.7/site-packages/theano/gradient.py", line 1162, in access_term_cache
    new_output_grads)

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
Lock freed

raghavmenon commented 6 years ago

Hi,

Did you find a solution for this memory error. I am getting almost the same error. Any help or advise would be welcome.

Thanks.

Regards, Raghav

userandre commented 6 years ago

Hello,

Did anyone find a solution on this memory error? I would be really grateful with any help.

Thanks, Andre

CSTR-Edinburgh / merlin

Running out of RAM, when training bigger AVM #337