facebookresearch / svoice

We provide a PyTorch implementation of the paper Voice Separation with an Unknown Number of Multiple Speakers In which, we present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.
Other
1.24k stars 178 forks source link

Validation loss starts increasing / goes to NaN #22

Open eran-shahar opened 3 years ago

eran-shahar commented 3 years ago

Hello, When trying to train your model on data from LibriSpeech corpus (custom created by me, and working well with other models), the validation loss decreases well for a few epochs, and than starts increasing fast until after 10-20 epochs the loss goes to NaN and an error occurs. Any idea what am I doing wrong? The speech data includes reverberation and noise if it matters.

I haven't changed much the config you provided, this is the config.yaml I use:

defaults:

Dataset related

sample_rate: 16000 segment: 4 stride: 1 # in seconds, how much to stride between training examples pad: true # if training sample is too short, pad it cv_maxlen: 8 validfull: 1 # use entire samples at valid

Logging and printing, and does not impact training

num_prints: 5 device: cuda num_workers: 4 verbose: 0 show: 0 # just show the model and its size and exit

Checkpointing, by default automatically load last checkpoint

checkpoint: True continue_from: '' # Only pass the name of the exp, like exp_dset=wham

this arg is ignored for the naming of the exp!

continue_best: True restart: False # Ignore existing checkpoints checkpoint_file: checkpoint.th history_file: history.json samples_dir: samples

Other stuff

seed: 2036 dummy: # use this if you want twice the same exp, with a name

Evaluation stuff

pesq: false # compute pesq? eval_every: 100 keep_last: 0

Optimization related

optim: adam lr: 5e-4 beta2: 0.999 stft_loss: False stft_sc_factor: .5 stft_mag_factor: .5 epochs: 100 batch_size: 2 max_norm: 5

learning rate scheduling

lr_sched: step # can be either step or plateau step: step_size: 2 gamma: 0.98 plateau: factor: 0.5 patience: 4

Models

model: swave # either demucs or dwave swave: N: 128 L: 16 H: 128 R: 6 C: 2 input_normalize: False

Experiment launching, distributed

ddp: false ddp_backend: nccl rendezvous_file: ./rendezvous

Internal config, don't set manually

rank: world_size:

Hydra config

hydra: run: dir: ./outputs/exp_${hydra.job.override_dirname} job: config:

configuration for the ${hydra.job.override_dirname} runtime variable

  override_dirname:
    kv_sep: '='
    item_sep: ','
    # Remove all paths, as the / in them would mess up things
    # Remove params that would not impact the training itself
    # Remove all slurm and submit params.
    # This is ugly I know...
    exclude_keys: [
      'hydra.job_logging.handles.file.filename',
      'dset.train', 'dset.valid', 'dset.test', 'dset.mix_json', 'dset.mix_dir',
      'num_prints', 'continue_from',
      'device', 'num_workers', 'print_freq', 'restart', 'verbose',
      'log', 'ddp', 'ddp_backend', 'rendezvous_file', 'rank', 'world_size']

job_logging: handlers: file: class: logging.FileHandler mode: w formatter: colorlog filename: trainer.log console: class: logging.StreamHandler formatter: colorlog stream: ext://sys.stderr

hydra_logging: handlers: console: class: logging.StreamHandler formatter: colorlog stream: ext://sys.stderr

adiyoss commented 3 years ago

Hi @eran-shahar, It seems like a numerical instability issue. We did not encounter such problems when training our model (also on noisy reverberant data). Can you please try the same training but set swave.input_normalize=True?

YaFanYen commented 3 years ago

Hi @adiyoss , I've got the same issue with setting swave.input_normalize=True Is there any way to solve it?

adiyoss commented 3 years ago

Hi @YaFanYen, Hard to say, I think you need to debug it. Do you see what parameters are causing the nan? is it due to the loss value or some other parameters?

CardLin commented 2 years ago

I've got similar issue here. I think this is a problem related to resume training. When I first run training, the loss go down to negative value normally. But the training crash on epoch04... When I resuem the training from epoch03, the loss start increasing and make model unusable.

[2021-11-11 08:29:00,151][main][INFO] - Running on host AI3 [2021-11-11 08:29:13,317][svoice.solver][INFO] - Loading checkpoint model: checkpoint.th [2021-11-11 08:29:13,443][svoice.solver][INFO] - Replaying metrics from previous run [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 0: train=-2.86717 valid=-15.02999 best=-15.02999 [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 1: train=-5.26632 valid=-16.74150 best=-16.74150 [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 2: train=-5.89136 valid=-17.62505 best=-17.62505 [2021-11-11 08:29:13,444][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-11 08:29:13,444][svoice.solver][INFO] - Training... [2021-11-11 17:17:06,027][svoice.solver][INFO] - Train | Epoch 4 | 40739/203699 | 1.3 it/sec | Loss -6.02060 [2021-11-12 02:05:04,463][svoice.solver][INFO] - Train | Epoch 4 | 81478/203699 | 1.3 it/sec | Loss -5.54255 [2021-11-12 10:53:21,507][svoice.solver][INFO] - Train | Epoch 4 | 122217/203699 | 1.3 it/sec | Loss 1.48981 [2021-11-12 19:41:24,288][svoice.solver][INFO] - Train | Epoch 4 | 162956/203699 | 1.3 it/sec | Loss 5.29561 [2021-11-13 04:29:36,752][svoice.solver][INFO] - Train | Epoch 4 | 203695/203699 | 1.3 it/sec | Loss 7.56894 [2021-11-13 04:29:39,775][svoice.solver][INFO] - Train Summary | End of Epoch 4 | Time 158426.33s | Train Loss 7.56921 [2021-11-13 04:29:39,775][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-13 04:29:39,775][svoice.solver][INFO] - Cross validation... [2021-11-13 04:32:27,829][svoice.solver][INFO] - Valid | Epoch 4 | 600/3000 | 3.6 it/sec | Loss 24.79471 [2021-11-13 04:34:36,808][svoice.solver][INFO] - Valid | Epoch 4 | 1200/3000 | 4.0 it/sec | Loss 24.63897 [2021-11-13 04:36:25,386][svoice.solver][INFO] - Valid | Epoch 4 | 1800/3000 | 4.4 it/sec | Loss 24.49754 [2021-11-13 04:37:59,553][svoice.solver][INFO] - Valid | Epoch 4 | 2400/3000 | 4.8 it/sec | Loss 24.35143 [2021-11-13 04:39:18,205][svoice.solver][INFO] - Valid | Epoch 4 | 3000/3000 | 5.2 it/sec | Loss 24.23823 [2021-11-13 04:39:18,205][svoice.solver][INFO] - Valid Summary | End of Epoch 4 | Time 159004.76s | Valid Loss 24.23823 [2021-11-13 04:39:18,206][svoice.solver][INFO] - Learning rate adjusted: 0.00049 [2021-11-13 04:39:18,206][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-13 04:39:18,206][svoice.solver][INFO] - Overall Summary | Epoch 4 | Train 7.56921 | Valid 24.23823 | Best -17.62505

qalabeabbas49 commented 2 years ago

Hi, I am facing the same issue here, Any solution ??