facebookresearch / AudioDec

An Open-source Streaming High-fidelity Neural Audio Codec
Other
431 stars 20 forks source link

vq_loss increase, not converge #17

Closed lixinghe1999 closed 4 months ago

lixinghe1999 commented 7 months ago

I am working on my dataset, whose channels = 2 and sampling rate = 16000. I paste my config file below, the major changes I made are: 1) sample_rate 2) data path 3) input output channel

sampling_rate: &sampling_rate 16000
data:
    path: "../ABCS/Audio"
    subset:
        train: "train"
        valid: "dev"
        test:  "test"

###########################################################
#                   MODEL SETTING                         #
###########################################################
model_type: symAudioDec
train_mode: autoencoder
paradigm: efficient

generator_params:
    input_channels: 2
    output_channels: 2 
    encode_channels: 32
    decode_channels: 32
    code_dim: 64
    codebook_num: 8
    codebook_size: 1024
    bias: true
    enc_ratios: [2, 4, 8, 16]
    dec_ratios: [16, 8, 4, 2]
    enc_strides: [3, 4, 5, 5]
    dec_strides: [5, 5, 4, 3]
    mode: 'causal'
    codec: 'audiodec'
    projector: 'conv1d'
    quantier: 'residual_vq'

discriminator_params:
    scales: 3                              # Number of multi-scale discriminator.
    scale_downsample_pooling: "AvgPool1d"  # Pooling operation for scale discriminator.
    scale_downsample_pooling_params:
        kernel_size: 4                     # Pooling kernel size.
        stride: 2                          # Pooling stride.
        padding: 2                         # Padding size.
    scale_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
        channels: 128                      # Initial number of channels.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
        bias: true
        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:
            negative_slope: 0.1
    follow_official_norm: true             # Whether to follow the official norm setting.
    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
    period_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [5, 3]               # List of kernel sizes.
        channels: 32                       # Initial number of channels.
        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        bias: true                         # Whether to use bias parameter in conv layer."
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:       # Nonlinear activation paramters.
            negative_slope: 0.1
        use_weight_norm: true              # Whether to apply weight normalization.
        use_spectral_norm: false           # Whether to apply spectral normalization.

###########################################################
#                 METRIC LOSS SETTING                     #
###########################################################
use_mel_loss: true                   # Whether to use Mel-spectrogram loss.
mel_loss_params:
    fs: *sampling_rate
    fft_sizes: [2048]
    hop_sizes: [300]
    win_lengths: [2048]
    window: "hann_window"
    num_mels: 80
    fmin: 0
    fmax: 12000
    log_base: null

use_stft_loss: false                 # Whether to use multi-resolution STFT loss.
stft_loss_params:
    fft_sizes: [1024, 2048, 512]     # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]        # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240]    # List of window length for STFT-based loss.
    window: "hann_window"            # Window function for STFT-based loss

use_shape_loss: false                # Whether to use waveform shape loss.
shape_loss_params:
    winlen: [300]

###########################################################
#                  ADV LOSS SETTING                       #
###########################################################
generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

use_feat_match_loss: true
feat_match_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
    average_by_layers: false         # Whether to average loss by #layers in each discriminator.
    include_final_outputs: false     # Whether to include final outputs in feat match loss calculation.

###########################################################
#                  LOSS WEIGHT SETTING                    #
###########################################################
lambda_adv: 0.1          # Loss weight of adversarial loss.
lambda_feat_match: 2.0   # Loss weight of feat match loss.
lambda_vq_loss: 1.0      # Loss weight of vector quantize loss.
lambda_mel_loss: 45.0    # Loss weight of mel-spectrogram spectloss.
lambda_stft_loss: 45.0   # Loss weight of multi-resolution stft loss.
lambda_shape_loss: 45.0  # Loss weight of multi-window shape loss.

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 64              # Batch size.
batch_length: 9600          # Length of each audio in batch (training w/o adv). Make sure dividable by hop_size.
adv_batch_length: 9600      # Length of each audio in batch (training w/ adv). Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.
num_workers: 8              # Number of workers in Pytorch DataLoader.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 1.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: StepLR
generator_scheduler_params:
    step_size: 200000      # Generator's scheduler step size.
    gamma: 1.0
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
discriminator_grad_norm: -1

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
start_steps:                       # Number of steps to start training
    generator: 0
    discriminator: 500000 
train_max_steps: 500000            # Number of training steps. (w/o adv)
adv_train_max_steps: 1000000       # Number of training steps. (w/ adv)
save_interval_steps: 100000        # Interval steps to save checkpoint.
eval_interval_steps: 1000          # Interval steps to evaluate the network.
log_interval_steps: 100            # Interval steps to record the training log.

1) In the stage 1 training (<500k), the mel_loss seems reasonable, but the vq_loss gets larger and larger, which seems weird.

1710138169017 1710138136376

2) In the stage 2 training, my mel loss will go much higher. Is the reason 1) I set the wrong lambda_adv or 2) the problem caused by bad vq_loss? What is the recommended way to work on it?

Thank you in advance!

bigpon commented 7 months ago

Hi, the vq_loss becoming higher during training is normal since the encoder usually outputs white noise like latent in the beginning. When the encoder starts to learn something meaningful will make the quantization difficult to reconstruct resulting in higher vq_loss.

The mel_loss will also become higher during GAN training since the objective of the GAN training is cheating the discriminator not reducing the mel loss.

However, if the vq_loss or mel_loss did not converge, it is a problem. According to your setting, I think the temporal-resolution downsampling ratio might be too high (enc_strides: [3, 4, 5, 5], dec_strides: [5, 5, 4, 3] make the downsampling ratio=300).

Taking a smaller temopral-resolution downsampling ratio may ease the problem. (For example, enc_strides: [2, 3, 4, 5], dec_strides: [5, 4, 3, 2])

a897456 commented 6 months ago

Taking a smaller temopral-resolution downsampling ratio may ease the problem. (For example, enc_strides: [2, 3, 4, 5], dec_strides: [5, 4, 3, 2])

I donot think so, because [2,3,4,5] means the downsampling ratio=120, 9600/120=80 > 64(codebook_dim)

a897456 commented 6 months ago

same question https://github.com/facebookresearch/AudioDec/issues/19 I would like to know how to adjust the parameters in config to achieve the best output for 16kHz input data. How did you finally adjust it? @lixinghe1999

bigpon commented 6 months ago

because [2,3,4,5] means the downsampling ratio=120, 9600/120=80 > 64(codebook_dim)

Hi, the downsampling is for the temporal axis, so it should be 48000 (48kHz)/120=400Hz of the codes, which is different from the code dimension 64. That is, for each second, you will get 400 64 (number of RVQ, here is 8).

a897456 commented 6 months ago

batch_length: 9600

yes, but batch_length: 9600 9600/120=80, so I think the stride should be changed with batch_length

lixinghe1999 commented 6 months ago

from my understanding, the batch_length only influences the gpu memory consumption so normally we don't need to worry about it (as long as it can be divided by downsample rate). the codebook dim you mentioned seems only work on the single time-frame, not relevant to the batch_length. please correct me if i am wrong.

bigpon commented 6 months ago

Yes, the batch_length is more related to the GPU useage, and the only requirement is that it can be divided by the downsample rate.

I actually found that the longer batch_length the better performance, which is straightforward, but the longer batch_length results in much longer training time in the second stage (w/ the GAN training).

However, the longer batch_length do not significantly increase the training time in the first stage, so I use 96000 in the 1st stage and 9600 in the 2nd stage in my latest settings.