eran-shahar commented 3 years ago

Hello, When trying to train your model on data from LibriSpeech corpus (custom created by me, and working well with other models), the validation loss decreases well for a few epochs, and than starts increasing fast until after 10-20 epochs the loss goes to NaN and an error occurs. Any idea what am I doing wrong? The speech data includes reverberation and noise if it matters.

I haven't changed much the config you provided, this is the config.yaml I use:

defaults:

dset: libri
hydra/job_logging: colorlog
hydra/hydra_logging: colorlog

Dataset related

sample_rate: 16000 segment: 4 stride: 1 # in seconds, how much to stride between training examples pad: true # if training sample is too short, pad it cv_maxlen: 8 validfull: 1 # use entire samples at valid

Logging and printing, and does not impact training

num_prints: 5 device: cuda num_workers: 4 verbose: 0 show: 0 # just show the model and its size and exit

Checkpointing, by default automatically load last checkpoint

checkpoint: True continue_from: '' # Only pass the name of the exp, like exp_dset=wham

this arg is ignored for the naming of the exp!

continue_best: True restart: False # Ignore existing checkpoints checkpoint_file: checkpoint.th history_file: history.json samples_dir: samples

Other stuff

seed: 2036 dummy: # use this if you want twice the same exp, with a name

Evaluation stuff

pesq: false # compute pesq? eval_every: 100 keep_last: 0

Optimization related

optim: adam lr: 5e-4 beta2: 0.999 stft_loss: False stft_sc_factor: .5 stft_mag_factor: .5 epochs: 100 batch_size: 2 max_norm: 5

learning rate scheduling

lr_sched: step # can be either step or plateau step: step_size: 2 gamma: 0.98 plateau: factor: 0.5 patience: 4

Models

model: swave # either demucs or dwave swave: N: 128 L: 16 H: 128 R: 6 C: 2 input_normalize: False

Experiment launching, distributed

ddp: false ddp_backend: nccl rendezvous_file: ./rendezvous

Internal config, don't set manually

rank: world_size:

Hydra config

hydra: run: dir: ./outputs/exp_${hydra.job.override_dirname} job: config:

configuration for the ${hydra.job.override_dirname} runtime variable

  override_dirname:
    kv_sep: '='
    item_sep: ','
    # Remove all paths, as the / in them would mess up things
    # Remove params that would not impact the training itself
    # Remove all slurm and submit params.
    # This is ugly I know...
    exclude_keys: [
      'hydra.job_logging.handles.file.filename',
      'dset.train', 'dset.valid', 'dset.test', 'dset.mix_json', 'dset.mix_dir',
      'num_prints', 'continue_from',
      'device', 'num_workers', 'print_freq', 'restart', 'verbose',
      'log', 'ddp', 'ddp_backend', 'rendezvous_file', 'rank', 'world_size']

job_logging: handlers: file: class: logging.FileHandler mode: w formatter: colorlog filename: trainer.log console: class: logging.StreamHandler formatter: colorlog stream: ext://sys.stderr

hydra_logging: handlers: console: class: logging.StreamHandler formatter: colorlog stream: ext://sys.stderr

adiyoss commented 3 years ago

Hi @eran-shahar, It seems like a numerical instability issue. We did not encounter such problems when training our model (also on noisy reverberant data). Can you please try the same training but set swave.input_normalize=True?

YaFanYen commented 3 years ago

Hi @adiyoss , I've got the same issue with setting swave.input_normalize=True Is there any way to solve it?

adiyoss commented 3 years ago

Hi @YaFanYen, Hard to say, I think you need to debug it. Do you see what parameters are causing the nan? is it due to the loss value or some other parameters?

CardLin commented 2 years ago

I've got similar issue here. I think this is a problem related to resume training. When I first run training, the loss go down to negative value normally. But the training crash on epoch04... When I resuem the training from epoch03, the loss start increasing and make model unusable.

[2021-11-11 08:29:00,151][main][INFO] - Running on host AI3 [2021-11-11 08:29:13,317][svoice.solver][INFO] - Loading checkpoint model: checkpoint.th [2021-11-11 08:29:13,443][svoice.solver][INFO] - Replaying metrics from previous run [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 0: train=-2.86717 valid=-15.02999 best=-15.02999 [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 1: train=-5.26632 valid=-16.74150 best=-16.74150 [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 2: train=-5.89136 valid=-17.62505 best=-17.62505 [2021-11-11 08:29:13,444][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-11 08:29:13,444][svoice.solver][INFO] - Training... [2021-11-11 17:17:06,027][svoice.solver][INFO] - Train | Epoch 4 | 40739/203699 | 1.3 it/sec | Loss -6.02060 [2021-11-12 02:05:04,463][svoice.solver][INFO] - Train | Epoch 4 | 81478/203699 | 1.3 it/sec | Loss -5.54255 [2021-11-12 10:53:21,507][svoice.solver][INFO] - Train | Epoch 4 | 122217/203699 | 1.3 it/sec | Loss 1.48981 [2021-11-12 19:41:24,288][svoice.solver][INFO] - Train | Epoch 4 | 162956/203699 | 1.3 it/sec | Loss 5.29561 [2021-11-13 04:29:36,752][svoice.solver][INFO] - Train | Epoch 4 | 203695/203699 | 1.3 it/sec | Loss 7.56894 [2021-11-13 04:29:39,775][svoice.solver][INFO] - Train Summary | End of Epoch 4 | Time 158426.33s | Train Loss 7.56921 [2021-11-13 04:29:39,775][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-13 04:29:39,775][svoice.solver][INFO] - Cross validation... [2021-11-13 04:32:27,829][svoice.solver][INFO] - Valid | Epoch 4 | 600/3000 | 3.6 it/sec | Loss 24.79471 [2021-11-13 04:34:36,808][svoice.solver][INFO] - Valid | Epoch 4 | 1200/3000 | 4.0 it/sec | Loss 24.63897 [2021-11-13 04:36:25,386][svoice.solver][INFO] - Valid | Epoch 4 | 1800/3000 | 4.4 it/sec | Loss 24.49754 [2021-11-13 04:37:59,553][svoice.solver][INFO] - Valid | Epoch 4 | 2400/3000 | 4.8 it/sec | Loss 24.35143 [2021-11-13 04:39:18,205][svoice.solver][INFO] - Valid | Epoch 4 | 3000/3000 | 5.2 it/sec | Loss 24.23823 [2021-11-13 04:39:18,205][svoice.solver][INFO] - Valid Summary | End of Epoch 4 | Time 159004.76s | Valid Loss 24.23823 [2021-11-13 04:39:18,206][svoice.solver][INFO] - Learning rate adjusted: 0.00049 [2021-11-13 04:39:18,206][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-13 04:39:18,206][svoice.solver][INFO] - Overall Summary | Epoch 4 | Train 7.56921 | Valid 24.23823 | Best -17.62505

qalabeabbas49 commented 2 years ago

Hi, I am facing the same issue here, Any solution ??

facebookresearch / svoice

Validation loss starts increasing / goes to NaN #22