Closed dmadhitama closed 3 years ago
What is your GPU?
What is your GPU?
DGX Sir, Tesla V-100 First and second try training with this repo was a fine and clear for 2 speaker separation model. This is happened when I'm trying to train 5 speaker separation model.
OK, what are lengths of the files you want to train on?
OK, what are lengths of the files you want to train on?
I guess it's approximately 5-20 second for each file. FYI, I'm using sample rate 16kHz here
Hi @dmadhitama, I think the problem starts when you separate some samples (epoch 10: https://github.com/facebookresearch/svoice/blob/master/conf/config.yaml#L37) Can you please check the set you would like to separate during training, maybe over there you have very long files?
Hi @dmadhitama, I think the problem starts when you separate some samples (epoch 10: https://github.com/facebookresearch/svoice/blob/master/conf/config.yaml#L37) Can you please check the set you would like to separate during training, maybe over there you have very long files?
Yeah, after I check all the files there is one file that has length around 44 seconds. But was it too long to train using 5 speaker separation model? If it was, should I shorten the files length or decrease the batch size to avoid the OOM CUDA things?
So it is not a training issue, it is a memory issue, your training should be fine. I'm not sure why this is happening during separation only, probably this file is significantly longer than the others. It won't help to decrease batch size, since the batch size is already 1, I suggest shorten this specific file. You can also try to verify that for this the input variable does not save gradients, I'll check it on my side as well.
So it is not a training issue, it is a memory issue, your training should be fine. I'm not sure why this is happening during separation only, probably this file is significantly longer than the others. It won't help to decrease batch size, since the batch size is already 1, I suggest shorten this specific file. You can also try to verify that for this the input variable does not save gradients, I'll check it on my side as well.
Yeah, as I said before. While my first and second trial using different data (2 spk separation), the training goes well until last epoch. And I recall that the memory used in GPU was around 15GB. However, in recent case (5 spk separation) this error occurred. Before 10th epoch the memory used in GPU was increasing around 32GB.
FYI, by default batch size in the config.yaml
was 4. And I read the paper the best results was also at batch size 4.
I don't know if I decrease the batch size will have a good results as well, but surely it will slow down the training process.
I already set the batch size to 3 and training still run very much slower than before. Recently it is still at epoch 7. The GPU memory used at 31GB. I don't know what happened at 10th epoch.
Ohhhh I thought you meant during separation. Sorry about that. Indeed during training the batch size is 4.
So you can do several things:
1) decrease the batch size as you did. It will make training slower, but should run.
2) Decrease the segment size: segment
the default is 4 sec.
3) Increase the kernel window: L
, the default is 8 and in the paper it is 2/4, so not sure what did you use, but you can try 16 too.
4) Decrease the number of blocks: R
, the default is 6. It will probably be at the expense of model performance not sure how much.
Thanks a lot for the solutions you gave @adiyoss ! I will try those!
For point 3, I'm using the default L
also, it is 8.
A question, Sir.
I'm not quite understand about the blocks parameter at point 4. What's the effect when I try to decrease or increase the R
parameter? The advantage & disadvantage of it?
Sure! you are welcome :)
The R
parameter basically controls the number of layers of the model. So model with a larger value of R
will have more layers and more params and probably better results (until it gets into a plateau), but also model size will be bigger.
Sure! you are welcome :) The
R
parameter basically controls the number of layers of the model. So model with a larger value ofR
will have more layers and more params and probably better results (until it gets into a plateau), but also model size will be bigger.
I see.. Thank you for the explanation!
Hello, so I was training my toy dataset for 5 speaker separation. My data was 200 audio file, so total data with each spk1, spk2, ..., spk5, and mix data is 1200 data. I was in around 10th epoch training and this is happened.
Any idea what was going on? Everything was fine until 10th epoch. Looking forward for the help. Thank you...