NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.37k stars 2.37k forks source link

Citrinet Training: Sentences are cut during prediction #9363

Closed huks0 closed 2 months ago

huks0 commented 2 months ago

Describe the bug

Currently I train a citrinet-512 Model. I copied the config from the example configs in here and didnt change them (https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/citrinet). After detecting the issue, I also used another config of hugging face where someone finetuned a citrinet model (https://huggingface.co/neongeckocom/stt_de_citrinet_512_gamma_0_25). Both configs lead to the same problem. A lot of sentences are cut randomly during prediction like e.g.

[NeMo I 2024-06-02 22:59:01 wer_bpe:302] reference:so ließ es sich nicht ändern er blieb oberleutnant und um so lieber weil ihm herr bantes sein gewesener vormund längst den winzigen rest seines väterlichen erbteils ausgehändigt hatte und dieses längst schon zu allen heiden ausgewandert war
[NeMo I 2024-06-02 22:59:01 wer_bpe:303] predicted:so ließ es sich nicht ändern er blieb oberleutnant und um so lieber weil ihm herr bantes sein gewesener vormund längst winzigen rest seines väterlichen erbteils ausgehändigt hatte

[NeMo I 2024-06-02 22:59:03 wer_bpe:302] reference:nicht mehr seine was ist denn aus ihr geworden die war so alt und gebrechlich dass sie schließlich zusammenkrachte schon vor einer guten reihe von jahren
[NeMo I 2024-06-02 22:59:03 wer_bpe:303] predicted:nicht mehr seine was ist denn aus ihr geworden die war so alt und gebrechlich dass sie schließlich zu seinem

This happens right after a few epochs (2 epochs) and doesnt vanish even after 90 epochs. Its not a dataset specific issue, but occurs randomly for several datasets. It does not happen for every sample but for a relevant share of the data. It happens for training and evaluation. I tried to figure out what the problem relates to or if any parameter could solve it, but couldnt detect where this issue comes from.

Steps/Code to reproduce bug

Here is the config used:


name: &name "Citrinet-512 training"

model:
  sample_rate: &sample_rate 16000
  log_prediction: true

  train_ds:
    manifest_filepath: "/train_cleaned.json"
    sample_rate: 16000
    batch_size: 32
    trim_silence: false
    max_duration: 16.0
    min_duration: 1.0
    shuffle: true
    use_start_end_token: false
    num_workers: 8
    pin_memory: true
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: 'synced_randomized'
    bucketing_batch_size: null

  validation_ds:
    manifest_filepath: "/dev_cleaned.json"
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true

  test_ds:
    manifest_filepath: "/test_cleaned.json"
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true

  model_defaults:
    repeat: 5
    dropout: 0.1
    separable: true
    se: true
    se_context_size: -1
    kernel_size_factor: 1
    filters: 512
    enc_final: 640

  tokenizer:
    dir: "tokenizer_spe_unigram_v1024"  # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe)
    type: "bpe"  # Can be either bpe or wpe

  preprocessor:
    _target_: 'nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor'
    sample_rate: 16000
    normalize: 'per_feature'
    window_size: 0.025
    window_stride: 0.01
    window: 'hann'
    features: &n_mels 80
    n_fft: 512
    frame_splicing: 1
    dither: 1e-05
    pad_to: 16
    stft_conv: false

  spec_augment:
    _target_: 'nemo.collections.asr.modules.SpectrogramAugmentation'
    freq_masks: 2
    time_masks: 5
    freq_width: 27
    time_width: 0.05

  encoder:
    _target_: nemo.collections.asr.modules.ConvASREncoder
    feat_in: *n_mels
    activation: relu
    conv_mask: true

    jasper:
      - filters: 512
        repeat: 1
        kernel: [5]
        stride: [1]
        dilation: [1]
        dropout: 0.0
        residual: false
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [11]
        stride: [2]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        stride_last: true
        residual_mode: "stride_add"
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [13]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [15]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [17]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [19]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [21]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [13]
        stride: [2]  # *stride
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        stride_last: true
        residual_mode: "stride_add"
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [15]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [17]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [19]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [21]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [23]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [25]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [25]
        stride: [2]  # stride
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        stride_last: true
        residual_mode: "stride_add"
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [27]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [29]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [31]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [33]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [35]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [37]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: 512
        repeat: ${model.model_defaults.repeat}
        kernel: [39]
        stride: [1]
        dilation: [1]
        dropout: ${model.model_defaults.dropout}
        residual: true
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

      - filters: ${model.model_defaults.enc_final}
        repeat: 1
        kernel: [41]
        stride: [1]
        dilation: [1]
        dropout: 0.0
        residual: false
        separable: ${model.model_defaults.separable}
        se: ${model.model_defaults.se}
        se_context_size: ${model.model_defaults.se_context_size}
        kernel_size_factor: ${model.model_defaults.kernel_size_factor}

  decoder:
    _target_: 'nemo.collections.asr.modules.ConvASRDecoder'
    feat_in: 640
    num_classes: 1024
    vocabulary: []

  optim:
    name: 'novograd'
    lr: 0.005
    betas: [0.8, 0.25]
    weight_decay: 0.0001
    sched:
      name: 'CosineAnnealing'
      warmup_steps: null
      warmup_ratio: 0.1
      min_lr: 1e-05
      last_epoch: -1

  target: 'nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE'
  nemo_version: '1.12.0'

  decoding:
    strategy: 'greedy'
    preserve_alignments: null
    compute_timestamps: null
    word_seperator: ' '
    ctc_timestamp_type: 'all'
    batch_dim_index: 0
    greedy:
      preserve_alignments: false
      compute_timestamps: false

trainer:
  devices: 2 # number of gpus
  max_epochs: 100
  max_steps: -1 # computed at runtime if not set
  num_nodes: 1
  accelerator: gpu
  strategy: auto
  accumulate_grad_batches: 1
  enable_checkpointing: True # Provided by exp_manager
  enable_progress_bar: True
  logger: false  # Provided by exp_manager
  log_every_n_steps: 50  # Interval of logging.
  val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
  check_val_every_n_epoch: 1
  precision: 32
  sync_batchnorm: false
  benchmark: false # needs to be false for models with variable-length speech input as it slows down training

exp_manager:
  exp_dir: null
  name: *name
  create_tensorboard_logger: false
  create_checkpoint_callback: false
  create_mlflow_logger: true
  mlflow_logger_kwargs:
    experiment_name: "training-citrinet-512"
    tracking_uri: [removed, here would be the tracking uri]

  checkpoint_callback_params:
    monitor: "val_wer"
    mode: "min"
    save_top_k: 3
    always_save_nemo: True #not tested yet, found this in a nemo repo
  create_wandb_logger: false
  wandb_logger_kwargs:
    name: null
    project: null
    entity: null
  resume_if_exists: false
  resume_ignore_no_checkpoint: false

Expected behavior

I expect the training to predict the sentences correctly. For a lot of the sentences it works, for some the predictions are just cut even though the rest was welll detected. I believe this affects the loss and the WER and hence its hard to judge how good the model is actually.

Environment overview (please complete the following information)

On Azure I set up an environment and trained multi-gpu. NeMo is pip installed via nemo_toolkit==1.21.0

Environment details

tensorflow-2.8-cuda11 python=3.8 torch=2.3.0

Additional context

titu1994 commented 2 months ago

Just saw this. Citrinet is a CTC model, did you check If your audio after 8x down sampling was shorter than the text transcript tokens in subword count ? That's often the reason for dropped words, especially because German transcripts are usually verbose and contain long words.

Id suggest trying a fast conformer transducer instead, 105M model should match the perf and memory of Citrinet 512 quite easily.