CUDA out of memory - Githubissues

dmadhitama commented 3 years ago

Hello, so I was training my toy dataset for 5 speaker separation. My data was 200 audio file, so total data with each spk1, spk2, ..., spk5, and mix data is 1200 data. I was in around 10th epoch training and this is happened.

[2021-01-28 16:59:10,810][svoice.evaluate][INFO] - Eval estimates | 40/200 | 1.1 it/sec                                                                                                                      
[2021-01-28 16:59:37,874][svoice.evaluate][INFO] - Eval estimates | 80/200 | 1.3 it/sec                                                                                                                      
[2021-01-28 17:00:01,654][svoice.evaluate][INFO] - Eval estimates | 120/200 | 1.4 it/sec                                                                                                                     
[2021-01-28 17:00:22,557][svoice.evaluate][INFO] - Eval estimates | 160/200 | 1.5 it/sec                                                                                                                     
[2021-01-28 17:00:40,419][svoice.evaluate][INFO] - Eval estimates | 200/200 | 1.6 it/sec                                                                                                                     
[2021-01-28 17:00:40,420][svoice.evaluate][INFO] - Eval metrics | 40/200 | 164561.2 it/sec
[2021-01-28 17:00:40,421][svoice.evaluate][INFO] - Eval metrics | 80/200 | 106802.5 it/sec
[2021-01-28 17:00:40,421][svoice.evaluate][INFO] - Eval metrics | 120/200 | 98584.1 it/sec
[2021-01-28 17:00:40,422][svoice.evaluate][INFO] - Eval metrics | 160/200 | 96469.0 it/sec
[2021-01-28 17:00:40,422][svoice.evaluate][INFO] - Eval metrics | 200/200 | 88817.4 it/sec
[2021-01-28 17:00:40,497][svoice.evaluate][INFO] - Test set performance: SISNRi=4.94 PESQ=0.0, STOI=0.0.
[2021-01-28 17:00:40,522][svoice.solver][INFO] - Separate and save samples...
  0%|                                                                                            | 0/50 [00:00<?, ?it/s]
[2021-01-28 17:00:41,458][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 118, in main
    _main(args)
  File "train.py", line 112, in _main
    run(args)
  File "train.py", line 93, in run
    solver.train()
  File "/home/donny.adhitama/dom_tools/svoice/svoice/solver.py", line 166, in train
    separate(self.args, self.model, self.samples_dir)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/separate.py", line 123, in separate
    estimate_sources = model(mixture)[-1]
File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 246, in forward
    output_all = self.separator(mixture_w)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 207, in forward
    output_all = self.rnn_model(enc_segments)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 108, in forward
    row_output = self.rows_grnn[i](row_input)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 47, in forward
    gate_rnn_output, _ = self.gate_rnn(output)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 577, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 4.17 GiB (GPU 0; 31.72 GiB total capacity; 10.77 GiB already allocated; 503.00 MiB free; 30.06 GiB reserved in total by PyTorch)

Any idea what was going on? Everything was fine until 10th epoch. Looking forward for the help. Thank you...

egorsmkv commented 3 years ago

What is your GPU?

dmadhitama commented 3 years ago

What is your GPU?

DGX Sir, Tesla V-100 Screenshot from 2021-01-28 17-59-57 First and second try training with this repo was a fine and clear for 2 speaker separation model. This is happened when I'm trying to train 5 speaker separation model.

egorsmkv commented 3 years ago

OK, what are lengths of the files you want to train on?

dmadhitama commented 3 years ago

OK, what are lengths of the files you want to train on?

I guess it's approximately 5-20 second for each file. FYI, I'm using sample rate 16kHz here

adiyoss commented 3 years ago

Hi @dmadhitama, I think the problem starts when you separate some samples (epoch 10: https://github.com/facebookresearch/svoice/blob/master/conf/config.yaml#L37) Can you please check the set you would like to separate during training, maybe over there you have very long files?

dmadhitama commented 3 years ago

Hi @dmadhitama, I think the problem starts when you separate some samples (epoch 10: https://github.com/facebookresearch/svoice/blob/master/conf/config.yaml#L37) Can you please check the set you would like to separate during training, maybe over there you have very long files?

Yeah, after I check all the files there is one file that has length around 44 seconds. But was it too long to train using 5 speaker separation model? If it was, should I shorten the files length or decrease the batch size to avoid the OOM CUDA things?

adiyoss commented 3 years ago

So it is not a training issue, it is a memory issue, your training should be fine. I'm not sure why this is happening during separation only, probably this file is significantly longer than the others. It won't help to decrease batch size, since the batch size is already 1, I suggest shorten this specific file. You can also try to verify that for this the input variable does not save gradients, I'll check it on my side as well.

dmadhitama commented 3 years ago

So it is not a training issue, it is a memory issue, your training should be fine. I'm not sure why this is happening during separation only, probably this file is significantly longer than the others. It won't help to decrease batch size, since the batch size is already 1, I suggest shorten this specific file. You can also try to verify that for this the input variable does not save gradients, I'll check it on my side as well.

Yeah, as I said before. While my first and second trial using different data (2 spk separation), the training goes well until last epoch. And I recall that the memory used in GPU was around 15GB. However, in recent case (5 spk separation) this error occurred. Before 10th epoch the memory used in GPU was increasing around 32GB.

FYI, by default batch size in the config.yaml was 4. And I read the paper the best results was also at batch size 4. I don't know if I decrease the batch size will have a good results as well, but surely it will slow down the training process. I already set the batch size to 3 and training still run very much slower than before. Recently it is still at epoch 7. The GPU memory used at 31GB. I don't know what happened at 10th epoch.

adiyoss commented 3 years ago

Ohhhh I thought you meant during separation. Sorry about that. Indeed during training the batch size is 4. So you can do several things: 1) decrease the batch size as you did. It will make training slower, but should run. 2) Decrease the segment size: segment the default is 4 sec. 3) Increase the kernel window: L, the default is 8 and in the paper it is 2/4, so not sure what did you use, but you can try 16 too. 4) Decrease the number of blocks: R, the default is 6. It will probably be at the expense of model performance not sure how much.

dmadhitama commented 3 years ago

Thanks a lot for the solutions you gave @adiyoss ! I will try those!

For point 3, I'm using the default L also, it is 8.

A question, Sir. I'm not quite understand about the blocks parameter at point 4. What's the effect when I try to decrease or increase the R parameter? The advantage & disadvantage of it?

adiyoss commented 3 years ago

Sure! you are welcome :) The R parameter basically controls the number of layers of the model. So model with a larger value of R will have more layers and more params and probably better results (until it gets into a plateau), but also model size will be bigger.

dmadhitama commented 3 years ago

Sure! you are welcome :) The R parameter basically controls the number of layers of the model. So model with a larger value of R will have more layers and more params and probably better results (until it gets into a plateau), but also model size will be bigger.

I see.. Thank you for the explanation!

facebookresearch / svoice

CUDA out of memory #14