41k Steps Not Aligned during training

Hello, I've been training this for quite a while on a custom LJSpeech-like dataset. However, I'm still seeing no alignment. I've tried a few hyperparameter changes from threads here, but I'm still not getting much improvement in alignment. Here's my most recent eval plot.

step-41000-eval-align

The sound quality isn't bad, if a bit metallic. How should I go about fixing this? (Each file in my dataset is ~5-7 seconds).

Here are my hparams:


  NN_init: True
  NN_scaler: 0.3
  allow_clipping_in_normalization: True
  attention_dim: 128
  attention_filters: 32
  attention_kernel: (31,)
  attention_win_size: 7
  batch_norm_position: after
  cbhg_conv_channels: 128
  cbhg_highway_units: 128
  cbhg_highwaynet_layers: 4
  cbhg_kernels: 8
  cbhg_pool_size: 2
  cbhg_projection: 256
  cbhg_projection_kernel_size: 3
  cbhg_rnn_units: 128
  cdf_loss: False
  cin_channels: 80
  cleaners: english_cleaners
  clip_for_wavenet: True
  clip_mels_length: True
  clip_outputs: True
  cross_entropy_pos_weight: 1
  cumulative_weights: True
  decoder_layers: 2
  decoder_lstm_units: 1024
  embedding_dim: 512
  enc_conv_channels: 512
  enc_conv_kernel_size: (5,)
  enc_conv_num_layers: 3
  encoder_lstm_units: 256
  fmax: 7600
  fmin: 95
  frame_shift_ms: None
  freq_axis_kernel_size: 3
  gate_channels: 256
  gin_channels: -1
  griffin_lim_iters: 60
  hop_size: 275
  input_type: raw
  kernel_size: 3
  layers: 20
  leaky_alpha: 0.4
  legacy: True
  log_scale_min: -32.23619130191664
  log_scale_min_gauss: -16.11809565095832
  lower_bound_decay: 0.1
  magnitude_power: 2.0
  mask_decoder: False
  mask_encoder: True
  max_abs_value: 4.0
  max_iters: 10000
  max_mel_frames: 900
  max_time_sec: None
  max_time_steps: 11000
  min_level_db: -100
  n_fft: 2048
  n_speakers: 5
  normalize_for_wavenet: True
  num_freq: 1025
  num_mels: 80
  out_channels: 2
  outputs_per_step: 2
  postnet_channels: 512
  postnet_kernel_size: (5,)
  postnet_num_layers: 5
  power: 1.5
  predict_linear: True
  preemphasis: 0.97
  preemphasize: True
  prenet_layers: [256, 256]
  quantize_channels: 65536
  ref_level_db: 20
  rescale: True
  rescaling_max: 0.999
  residual_channels: 128
  residual_legacy: True
  sample_rate: 22050
  signal_normalization: True
  silence_threshold: 2
  skip_out_channels: 128
  smoothing: False
  speakers: ['speaker0', 'speaker1', 'speaker2', 'speaker3', 'speaker4']
  speakers_path: None
  split_on_cpu: True
  stacks: 2
  stop_at_any: True
  symmetric_mels: True
  synthesis_constraint: False
  synthesis_constraint_type: window
  tacotron_adam_beta1: 0.9
  tacotron_adam_beta2: 0.999
  tacotron_adam_epsilon: 1e-06
  tacotron_batch_size: 32
  tacotron_clip_gradients: True
  tacotron_data_random_state: 1234
  tacotron_decay_learning_rate: True
  tacotron_decay_rate: 0.5
  tacotron_decay_steps: 18000
  tacotron_dropout_rate: 0.5
  tacotron_final_learning_rate: 0.0001
  tacotron_fine_tuning: False
  tacotron_initial_learning_rate: 0.001
  tacotron_natural_eval: False
  tacotron_num_gpus: 1
  tacotron_random_seed: 5339
  tacotron_reg_weight: 1e-06
  tacotron_scale_regularization: False
  tacotron_start_decay: 40000
  tacotron_swap_with_cpu: False
  tacotron_synthesis_batch_size: 1
  tacotron_teacher_forcing_decay_alpha: None
  tacotron_teacher_forcing_decay_steps: 40000
  tacotron_teacher_forcing_final_ratio: 0.0
  tacotron_teacher_forcing_init_ratio: 1.0
  tacotron_teacher_forcing_mode: constant
  tacotron_teacher_forcing_ratio: 1.0
  tacotron_teacher_forcing_start_decay: 10000
  tacotron_test_batches: None
  tacotron_test_size: 0.05
  tacotron_zoneout_rate: 0.1
  train_with_GTA: True
  trim_fft_size: 2048
  trim_hop_size: 512
  trim_silence: True
  trim_top_db: 40
  upsample_activation: Relu
  upsample_scales: [11, 25]
  upsample_type: SubPixel
  use_bias: True
  use_lws: False
  use_speaker_embedding: True
  wavenet_adam_beta1: 0.9
  wavenet_adam_beta2: 0.999
  wavenet_adam_epsilon: 1e-06
  wavenet_batch_size: 8
  wavenet_clip_gradients: True
  wavenet_data_random_state: 1234
  wavenet_debug_mels: ['training_data/mels/mel-LJ001-0008.npy']
  wavenet_debug_wavs: ['training_data/audio/audio-LJ001-0008.npy']
  wavenet_decay_rate: 0.5
  wavenet_decay_steps: 200000
  wavenet_dropout: 0.05
  wavenet_ema_decay: 0.9999
  wavenet_gradient_max_norm: 100.0
  wavenet_gradient_max_value: 5.0
  wavenet_init_scale: 1.0
  wavenet_learning_rate: 0.001
  wavenet_lr_schedule: exponential
  wavenet_natural_eval: False
  wavenet_num_gpus: 1
  wavenet_pad_sides: 1
  wavenet_random_seed: 5339
  wavenet_swap_with_cpu: False
  wavenet_synth_debug: False
  wavenet_synthesis_batch_size: 20
  wavenet_test_batches: 1
  wavenet_test_size: None
  wavenet_warmup: 4000.0
  wavenet_weight_normalization: False
  win_size: 1100```

Rayhane-mamah / Tacotron-2

41k Steps Not Aligned during training #430