We provide a PyTorch implementation of the paper Voice Separation with an Unknown Number of Multiple Speakers In which, we present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.
Other
1.25k
stars
179
forks
source link
Cross validation fails with error during training #97
Hello, this implementation does (should do) exactly what I need for a project I am working on.
However, I could not get the older versions of the torch+cuda and numpy modules to work on the the NVIDIA L4 GPU I am using for the project. I upgraded the torch version to 1.13.1 and the GPU has CUDA 12.4 installed. I also had to upgrade numpy version to 1.21.6, without which I get the following error -
File "train.py", line 120, in main
_main(args)
File "train.py", line 114, in _main
run(args)
File "train.py", line 32, in run
from svoice.solver import Solver
File "/home/vineet/svoice/svoice/solver.py", line 23, in <module>
from .evaluate import evaluate
File "/home/vineet/svoice/svoice/evaluate.py", line 16, in <module>
from pesq import pesq
File "/home/vineet/svoice/.testing/lib/python3.7/site-packages/pesq/__init__.py", line 6, in <module>
from .cypesq import cypesq
File "pesq/cypesq.pyx", line 1, in init cypesq
ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it).
After updating these I was able to get the training script, train.py to start without interpreter errors, but the script fails during the cross validation step/process with the following error
[2024-06-01 16:03:45,776][__main__][INFO] - For logs, checkpoints and samples check /home/vineet/svoice/outputs/exp_
[2024-06-01 16:03:56,183][__main__][INFO] - Running on host training-l4-2-vcpus-24-ram-96-ubuntu
[2024-06-01 16:03:58,471][svoice.solver][DEBUG] - Checkpoint will be saved to /home/vineet/svoice/outputs/debug/model.th
[2024-06-01 16:03:58,472][svoice.solver][INFO] - ----------------------------------------------------------------------
[2024-06-01 16:03:58,472][svoice.solver][INFO] - Training...
[2024-06-01 16:03:59,818][svoice.solver][INFO] - Train | Epoch 1 | 3/15 | 3.5 it/sec | Loss 21.13142
[2024-06-01 16:04:00,384][svoice.solver][INFO] - Train | Epoch 1 | 6/15 | 4.1 it/sec | Loss 21.46726
[2024-06-01 16:04:00,954][svoice.solver][INFO] - Train | Epoch 1 | 9/15 | 4.4 it/sec | Loss 21.30898
[2024-06-01 16:04:01,521][svoice.solver][INFO] - Train | Epoch 1 | 12/15 | 4.6 it/sec | Loss 21.40352
[2024-06-01 16:04:02,067][svoice.solver][INFO] - Train | Epoch 1 | 15/15 | 4.7 it/sec | Loss 21.39990
[2024-06-01 16:04:02,070][svoice.solver][INFO] - Train Summary | End of Epoch 1 | Time 3.60s | Train Loss 21.39990
[2024-06-01 16:04:02,070][svoice.solver][INFO] - ----------------------------------------------------------------------
[2024-06-01 16:04:02,070][svoice.solver][INFO] - Cross validation...
[2024-06-01 16:04:02,330][__main__][ERROR] - Some error happened
Traceback (most recent call last):
File "train.py", line 120, in main
_main(args)
File "train.py", line 114, in _main
run(args)
File "train.py", line 95, in run
solver.train()
File "/home/vineet/svoice/svoice/solver.py", line 133, in train
valid_loss = self._run_one_epoch(epoch, cross_valid=True)
File "/home/vineet/svoice/svoice/solver.py", line 213, in _run_one_epoch
estimate_source = self.dmodel(mixture)
File "/home/vineet/svoice/.testing/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/vineet/svoice/svoice/models/swave.py", line 256, in forward
mixture_w = self.encoder(mixture)
File "/home/vineet/svoice/.testing/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/vineet/svoice/svoice/models/swave.py", line 284, in forward
mixture_w = F.relu(self.conv(mixture))
File "/home/vineet/svoice/.testing/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/vineet/svoice/.testing/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/vineet/svoice/.testing/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (0). Kernel size: (8). Kernel size can't be greater than actual input size
After doing some searching, it appears that this could be a function of the training input .wav files. However, I am trying to use the training dataset provided with the repo, so would have thought that would be something that worked out of the box.
If I skip the cross validation step by setting the cross_valid parameter to False in the solver.py script, the training progresses but I encounter errors in the SWave model's Encoder's forward() method wherein the Conv1d() function fails. Also, I tried upgrading to Python 3.12, with corresponding updates to the dependencies, but run into the same issues.
When I skip steps, such as cross validation or get around the Conv1d() issues by providing default or empty tensors, I was able to get the training and evaluation to run, but the output speaker files have a monotone, continuous beeping sound overlayed on the speaker's voice, which I assume is a result of not performing cross validation or the convolution functions().
Any help in this regard is much appreciated. If I can get this implementation working, it is an ideal fit for a social project I am working on. Please let me know if you need additional information. Thanks.
Hello, this implementation does (should do) exactly what I need for a project I am working on.
However, I could not get the older versions of the
torch+cuda
andnumpy
modules to work on the the NVIDIA L4 GPU I am using for the project. I upgraded thetorch
version to1.13.1
and the GPU hasCUDA 12.4
installed. I also had to upgradenumpy
version to1.21.6
, without which I get the following error -After updating these I was able to get the training script,
train.py
to start without interpreter errors, but the script fails during the cross validation step/process with the following errorAfter doing some searching, it appears that this could be a function of the training input .wav files. However, I am trying to use the training dataset provided with the repo, so would have thought that would be something that worked out of the box.
If I skip the cross validation step by setting the
cross_valid
parameter toFalse
in thesolver.py
script, the training progresses but I encounter errors in the SWave model's Encoder'sforward()
method wherein theConv1d()
function fails. Also, I tried upgrading to Python 3.12, with corresponding updates to the dependencies, but run into the same issues.When I skip steps, such as cross validation or get around the
Conv1d()
issues by providing default or empty tensors, I was able to get the training and evaluation to run, but the output speaker files have a monotone, continuous beeping sound overlayed on the speaker's voice, which I assume is a result of not performing cross validation or the convolution functions().Any help in this regard is much appreciated. If I can get this implementation working, it is an ideal fit for a social project I am working on. Please let me know if you need additional information. Thanks.