Open eran-shahar opened 3 years ago
Hi @eran-shahar,
It seems like a numerical instability issue. We did not encounter such problems when training our model (also on noisy reverberant data).
Can you please try the same training but set swave.input_normalize=True
?
Hi @adiyoss ,
I've got the same issue with setting swave.input_normalize=True
Is there any way to solve it?
Hi @YaFanYen, Hard to say, I think you need to debug it. Do you see what parameters are causing the nan? is it due to the loss value or some other parameters?
I've got similar issue here. I think this is a problem related to resume training. When I first run training, the loss go down to negative value normally. But the training crash on epoch04... When I resuem the training from epoch03, the loss start increasing and make model unusable.
[2021-11-11 08:29:00,151][main][INFO] - Running on host AI3 [2021-11-11 08:29:13,317][svoice.solver][INFO] - Loading checkpoint model: checkpoint.th [2021-11-11 08:29:13,443][svoice.solver][INFO] - Replaying metrics from previous run [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 0: train=-2.86717 valid=-15.02999 best=-15.02999 [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 1: train=-5.26632 valid=-16.74150 best=-16.74150 [2021-11-11 08:29:13,443][svoice.solver][INFO] - Epoch 2: train=-5.89136 valid=-17.62505 best=-17.62505 [2021-11-11 08:29:13,444][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-11 08:29:13,444][svoice.solver][INFO] - Training... [2021-11-11 17:17:06,027][svoice.solver][INFO] - Train | Epoch 4 | 40739/203699 | 1.3 it/sec | Loss -6.02060 [2021-11-12 02:05:04,463][svoice.solver][INFO] - Train | Epoch 4 | 81478/203699 | 1.3 it/sec | Loss -5.54255 [2021-11-12 10:53:21,507][svoice.solver][INFO] - Train | Epoch 4 | 122217/203699 | 1.3 it/sec | Loss 1.48981 [2021-11-12 19:41:24,288][svoice.solver][INFO] - Train | Epoch 4 | 162956/203699 | 1.3 it/sec | Loss 5.29561 [2021-11-13 04:29:36,752][svoice.solver][INFO] - Train | Epoch 4 | 203695/203699 | 1.3 it/sec | Loss 7.56894 [2021-11-13 04:29:39,775][svoice.solver][INFO] - Train Summary | End of Epoch 4 | Time 158426.33s | Train Loss 7.56921 [2021-11-13 04:29:39,775][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-13 04:29:39,775][svoice.solver][INFO] - Cross validation... [2021-11-13 04:32:27,829][svoice.solver][INFO] - Valid | Epoch 4 | 600/3000 | 3.6 it/sec | Loss 24.79471 [2021-11-13 04:34:36,808][svoice.solver][INFO] - Valid | Epoch 4 | 1200/3000 | 4.0 it/sec | Loss 24.63897 [2021-11-13 04:36:25,386][svoice.solver][INFO] - Valid | Epoch 4 | 1800/3000 | 4.4 it/sec | Loss 24.49754 [2021-11-13 04:37:59,553][svoice.solver][INFO] - Valid | Epoch 4 | 2400/3000 | 4.8 it/sec | Loss 24.35143 [2021-11-13 04:39:18,205][svoice.solver][INFO] - Valid | Epoch 4 | 3000/3000 | 5.2 it/sec | Loss 24.23823 [2021-11-13 04:39:18,205][svoice.solver][INFO] - Valid Summary | End of Epoch 4 | Time 159004.76s | Valid Loss 24.23823 [2021-11-13 04:39:18,206][svoice.solver][INFO] - Learning rate adjusted: 0.00049 [2021-11-13 04:39:18,206][svoice.solver][INFO] - ---------------------------------------------------------------------- [2021-11-13 04:39:18,206][svoice.solver][INFO] - Overall Summary | Epoch 4 | Train 7.56921 | Valid 24.23823 | Best -17.62505
Hi, I am facing the same issue here, Any solution ??
Hello, When trying to train your model on data from LibriSpeech corpus (custom created by me, and working well with other models), the validation loss decreases well for a few epochs, and than starts increasing fast until after 10-20 epochs the loss goes to NaN and an error occurs. Any idea what am I doing wrong? The speech data includes reverberation and noise if it matters.
I haven't changed much the config you provided, this is the config.yaml I use:
defaults:
Dataset related
sample_rate: 16000 segment: 4 stride: 1 # in seconds, how much to stride between training examples pad: true # if training sample is too short, pad it cv_maxlen: 8 validfull: 1 # use entire samples at valid
Logging and printing, and does not impact training
num_prints: 5 device: cuda num_workers: 4 verbose: 0 show: 0 # just show the model and its size and exit
Checkpointing, by default automatically load last checkpoint
checkpoint: True continue_from: '' # Only pass the name of the exp, like
exp_dset=wham
this arg is ignored for the naming of the exp!
continue_best: True restart: False # Ignore existing checkpoints checkpoint_file: checkpoint.th history_file: history.json samples_dir: samples
Other stuff
seed: 2036 dummy: # use this if you want twice the same exp, with a name
Evaluation stuff
pesq: false # compute pesq? eval_every: 100 keep_last: 0
Optimization related
optim: adam lr: 5e-4 beta2: 0.999 stft_loss: False stft_sc_factor: .5 stft_mag_factor: .5 epochs: 100 batch_size: 2 max_norm: 5
learning rate scheduling
lr_sched: step # can be either step or plateau step: step_size: 2 gamma: 0.98 plateau: factor: 0.5 patience: 4
Models
model: swave # either demucs or dwave swave: N: 128 L: 16 H: 128 R: 6 C: 2 input_normalize: False
Experiment launching, distributed
ddp: false ddp_backend: nccl rendezvous_file: ./rendezvous
Internal config, don't set manually
rank: world_size:
Hydra config
hydra: run: dir: ./outputs/exp_${hydra.job.override_dirname} job: config:
configuration for the ${hydra.job.override_dirname} runtime variable
job_logging: handlers: file: class: logging.FileHandler mode: w formatter: colorlog filename: trainer.log console: class: logging.StreamHandler formatter: colorlog stream: ext://sys.stderr
hydra_logging: handlers: console: class: logging.StreamHandler formatter: colorlog stream: ext://sys.stderr