Mismatch between val_cer checkpoints while training and speech_to_text_eval.py on same dataset.

Describe the bug

I am training a conformer-transducer medium model and on the validation set I reach val_wer=0.0157-epoch=13.ckpt, where val_wer is actualy ChER as i use use_cer=True. When I try to perform inference on the same validation set using speech_to_text_eval.py (greedy decoding like in training) and that model (or the saved .nemo model in checkpoints/ dir, it's not a conversion from .ckpt to .nemo problem) I get different results: 5.5 WER/ 2.45 ChER. I understood that there might be some differences between the checkpoints values and inference, but 1% is pretty significant

I would like to mention that when using WER as the main metric with a fastconformer transducer large model, I didn't get this weird behavior. The wer from checkpoints matched the wer from the inference script.

I use nemo docker v23.04 and this is my config:

# It contains the default values for training a Conformer-Transducer ASR model, large size (~120M) with Transducer loss and sub-word encoding.

# Architecture and training config:
# Default learning parameters in this config are set for effective batch size of 2K. To train it with smaller effective
# batch sizes, you may need to re-tune the learning parameters or use higher accumulate_grad_batches.
# Here are the recommended configs for different variants of Conformer-Transducer, other parameters are the same as in this config file.
#
#  +--------------+---------+---------+----------+------------------+--------------+--------------------------+-----------------+
#  | Model        | d_model | n_heads | n_layers | conv_kernel_size | weight_decay | pred_hidden/joint_hidden | pred_rnn_layers |
#  +==============+=========+========+===========+==================+==============+==========================+=================+
#  | Small   (14M)|   176   |    4   |    16     |       31         |     0.0      |           320            |        1        |
#  +--------------+---------+--------+-----------+------------------+--------------+--------------------------+-----------------+
#  | Medium  (32M)|   256   |    4   |    16     |       31         |     1e-3     |           640            |        1        |
#  +--------------+---------+--------+-----------+------------------+--------------+--------------------------+-----------------+
#  | Large  (120M)|   512   |    8   |    17     |       31         |     1e-3     |           640            |        1        |
#  +--------------+---------+--------+-----------+------------------+--------------+--------------------------+-----------------+
#  | XLarge (644M)|  1024   |    8   |    24     |        5         |     1e-3     |           640            |        2        |
#  +--------------+---------+--------+-----------+------------------+--------------+--------------------------+-----------------+  

# Default learning parameters in this config are set for global batch size of 2K while you may use lower values.
# To increase the global batch size with limited number of GPUs, you may use higher accumulate_grad_batches.
# However accumulate_grad_batches is better to be avoided as long as the global batch size is large enough and training is stable.

# You may find more info about Conformer-Transducer here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#conformer-transducer
# Pre-trained models of Conformer-Transducer can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html
# The checkpoint of the large model trained on NeMo ASRSET with this recipe can be found here: https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large

# We suggest to use trainer.precision=bf16 for GPUs which support it otherwise trainer.precision=16 is recommended.
# Using bf16 or 16 would make it possible to double the batch size and speedup training/inference. If fp16 is not stable and model diverges after some epochs, you may use fp32.
# Here are the suggested batch size per GPU for each precision and memory sizes:
#  +-----------+------------+------------+
#  | Precision | GPU Memory | Batch Size |
#  +===========+============+============+
#  | 32        |    16GB    |     8      |
#  |           |    32GB    |     16     |
#  |           |    80GB    |     32     |
#  +-----------+------------+------------+
#  | 16 or     |    16GB    |     16     |
#  | bf16      |    32GB    |     32     |
#  |           |    80GB    |     64     |
#  +-----------+------------+------------+
# Note:  They are based on the assumption of max_duration of 20. If you have longer or shorter max_duration, then batch sizes may need to get updated accordingly.

name: "Conformer-Transducer-BPE"
init_from_nemo_model: "models/stt_en_conformer_transducer_medium.nemo"
init_strict: false 

model:
  sample_rate: 16000
  compute_eval_loss: true # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag.
  log_prediction: true # enables logging sample predictions in the output during training
  skip_nan_grad: false
  use_cer: true

  model_defaults:
    enc_hidden: ${model.encoder.d_model}
    pred_hidden: 640
    joint_hidden: 640

  train_ds:
    manifest_filepath: "manifests/train.json"
    sample_rate: ${model.sample_rate}
    batch_size: 8 # you may increase batch_size if your memory allows
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20 
    min_duration: 0.1
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null
    augmentor:
      #white_noise:
      #  prob: 0.3
      #  min_level: -90.0
      #  max_level: -46.0
      #shift:
      #  prob: 0.3
      #  min_shift_ms: -20.0
      #  max_shift_ms: 20.0
      #gain:
      #  prob: 0.3
      #  min_gain_dbfs: -10.0
      #  max_gain_dbfs: 10.0
      speed:
        prob: 0.4
        sr: 16000
        resample_type: "kaiser_best"
        min_speed_rate: 0.9
        max_speed_rate: 1.1

  validation_ds:
    manifest_filepath: "manifests/val.json"
    sample_rate: ${model.sample_rate}
    batch_size: 8
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true

  test_ds:
    manifest_filepath: null
    sample_rate: ${model.sample_rate}
    batch_size: 8
    shuffle: false
    use_start_end_token: false
    num_workers: 8
    pin_memory: true

  # You may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
  # recommend to use SPE Unigram tokenizer with vocab size of 1K to 4k when using 4x sub-sampling
  tokenizer:
    dir: "tokenizers/tokenizer_unigram/tokenizer_spe_unigram_v1024_max_5"  # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe)
    type: bpe  # Can be either bpe (SentencePiece tokenizer) or wpe (WordPiece tokenizer)

  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    sample_rate: ${model.sample_rate}
    normalize: "per_feature"
    window_size: 0.025
    window_stride: 0.01
    window: "hann"
    features: 80
    n_fft: 512
    frame_splicing: 1
    dither: 0.00001
    pad_to: 0

  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    freq_masks: 5 # set to zero to disable it
    freq_width: 27
    time_masks: 10 # set to zero to disable it
    time_width: 0.05

  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: ${model.preprocessor.features}
    feat_out: -1 # you may set it if you need different output size other than the default d_model
    n_layers: 16
    d_model: 256

    # Sub-sampling parameters
    subsampling: striding # vggnet, striding, stacking or stacking_norm, dw_striding
    subsampling_factor: 4 # must be power of 2 for striding and vggnet
    subsampling_conv_channels: -1 # set to -1 to make it equal to the d_model
    causal_downsampling: false

    # Reduction parameters: Can be used to add another subsampling layer at a given position.
    # Having a 2x reduction will speedup the training and inference speech while keeping similar WER.
    # Adding it at the end will give the best WER while adding it at the beginning will give the best speedup.
    reduction: null # pooling, striding, or null
    reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder
    reduction_factor: 1

    # Feed forward module's params
    ff_expansion_factor: 4

    # Multi-headed Attention Module's params
    self_attention_model: rel_pos # rel_pos or abs_pos
    n_heads: 4 # may need to be lower for smaller d_models
    # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
    att_context_size: [-1, -1] # -1 means unlimited context
    att_context_style: regular # regular or chunked_limited
    xscaling: true # scales up the input embeddings by sqrt(d_model)
    untie_biases: true # unties the biases of the TransformerXL layers
    pos_emb_max_len: 5000

    # Convolution module's params
    conv_kernel_size: 31
    conv_norm_type: 'batch_norm' # batch_norm or layer_norm or groupnormN (N specifies the number of groups)
    # conv_context_size can be"causal" or a list of two integers while conv_context_size[0]+conv_context_size[1]+1==conv_kernel_size
    # null means [(kernel_size-1)//2, (kernel_size-1)//2], and 'causal' means [(kernel_size-1), 0]
    conv_context_size: null

    ### regularization
    dropout: 0.1 # The dropout used in most of the Conformer Modules
    dropout_pre_encoder: 0.1 # The dropout used before the encoder
    dropout_emb: 0.0 # The dropout used for embeddings
    dropout_att: 0.1 # The dropout for multi-headed attention modules

    # set to non-zero to enable stochastic depth
    stochastic_depth_drop_prob: 0.0
    stochastic_depth_mode: linear  # linear or uniform
    stochastic_depth_start_layer: 1

  decoder:
    _target_: nemo.collections.asr.modules.RNNTDecoder
    normalization_mode: null # Currently only null is supported for export.
    random_state_sampling: false # Random state sampling: https://arxiv.org/pdf/1910.11455.pdf
    blank_as_pad: true # This flag must be set in order to support exporting of RNNT models + efficient inference.

    prednet:
      pred_hidden: ${model.model_defaults.pred_hidden}
      pred_rnn_layers: 1
      t_max: null
      dropout: 0.2

  joint:
    _target_: nemo.collections.asr.modules.RNNTJoint
    log_softmax: null  # 'null' would set it automatically according to CPU/GPU device
    preserve_memory: false  # dramatically slows down training, but might preserve some memory

    # Fuses the computation of prediction net + joint net + loss + WER calculation
    # to be run on sub-batches of size `fused_batch_size`.
    # When this flag is set to true, consider the `batch_size` of *_ds to be just `encoder` batch size.
    # `fused_batch_size` is the actual batch size of the prediction net, joint net and transducer loss.
    # Using small values here will preserve a lot of memory during training, but will make training slower as well.
    # An optimal ratio of fused_batch_size : *_ds.batch_size is 1:1.
    # However, to preserve memory, this ratio can be 1:8 or even 1:16.
    # Extreme case of 1:B (i.e. fused_batch_size=1) should be avoided as training speed would be very slow.
    fuse_loss_wer: true
    fused_batch_size: 4

    jointnet:
      joint_hidden: ${model.model_defaults.joint_hidden}
      activation: "relu"
      dropout: 0.2

  decoding:
    strategy: "greedy_batch" # can be greedy, greedy_batch, beam, tsd, alsd.

    # greedy strategy config
    greedy:
      max_symbols: 10

    # beam strategy config
    beam:
      beam_size: 2
      return_best_hypothesis: False
      score_norm: true
      tsd_max_sym_exp: 50  # for Time Synchronous Decoding
      alsd_max_target_len: 2.0  # for Alignment-Length Synchronous Decoding

  loss:
    loss_name: "default"

    warprnnt_numba_kwargs:
      # FastEmit regularization: https://arxiv.org/abs/2010.11148
      # You may enable FastEmit to reduce the latency of the model for streaming
      fastemit_lambda: 0.0  # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
      clamp: -1.0  # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

  # Adds Gaussian noise to the gradients of the decoder to avoid overfitting
  variational_noise:
    start_step: 0
    std: 0.0

  optim:
    name: adamw
    lr: 2
    # optimizer arguments
    betas: [0.9, 0.98]
    weight_decay: 1e-3

    # scheduler setup
    sched:
      name: NoamAnnealing
      d_model: ${model.encoder.d_model}
      # scheduler config override
      warmup_steps: 10000
      warmup_ratio: null #0.1 or 0.05
      min_lr: 1e-5 # inainte era 1e-6, dar aparent nu trebuie sa scada sub 2e-5 ptr a antrena eficient

trainer:
  devices: 1 # number of GPUs, -1 would use all available GPUs
  num_nodes: 1
  max_epochs: 500
  max_steps: -1 # computed at runtime if not set
  val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
  accelerator: "gpu"
  strategy: ddp
  accumulate_grad_batches: 32 # am marit de la 8 (cand aveam 2 gpu-uri) ptr un singur GPU
  gradient_clip_val: 0.0
  precision: 16 # 16, 32, or bf16
  log_every_n_steps: 100  # Interval of logging.
  enable_progress_bar: True
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
  check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
  sync_batchnorm: true
  enable_checkpointing: False  # Provided by exp_manager
  logger: false  # Provided by exp_manager
  benchmark: false # needs to be false for models with variable-length speech input as it slows down training

exp_manager:
  exp_dir: null
  name: full
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    # in case of multiple validation sets, first one is used
    monitor: "val_wer"
    mode: "min"
    save_top_k: 5
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints
    save_best_model: True
  resume_if_exists: True
  resume_ignore_no_checkpoint: True

  create_wandb_logger: false
  wandb_logger_kwargs:
    name: null
    project: null

I would greatly appreaciate any ideas on what may cause this behavior!

Hi,

This is one of the commands I ran:

python3 speech_to_text_eval.py dataset_manifest=train_dir/manifests/eval_manifest.json model_path=train_dir/nemo_experiments/azure_full_ro_epoca14/checkpoints/full_ro.nemo output_filename=eval_epoca14

And this is the script:

# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Script to compute the Word or Character Error Rate of a given ASR model for a given manifest file for some dataset.
The manifest file must conform to standard ASR definition - containing `audio_filepath` and `text` as the ground truth.

Note: This script depends on the `transcribe_speech.py` script, and therefore both scripts should be located in the
same directory during execution.

# Arguments

<< All arguments of `transcribe_speech.py` are inherited by this script, so please refer to `transcribe_speech.py`
for full list of arguments >>

    dataset_manifest: Required - path to dataset JSON manifest file (in NeMo format)
    output_filename: Optional - output filename where the transcriptions will be written.

    use_cer: Bool, whether to compute CER or WER
    tolerance: Float, minimum WER/CER required to pass some arbitrary tolerance.

    only_score_manifest: Bool, when set will skip audio transcription and just calculate WER of provided manifest.

# Usage

## To score a dataset with a manifest file that does not contain previously transcribed `pred_text`.

python speech_to_text_eval.py \
    model_path=null \
    pretrained_name=null \
    dataset_manifest=<Mandatory: Path to an ASR dataset manifest file> \
    output_filename=<Optional: Some output filename which will hold the transcribed text as a manifest> \
    batch_size=32 \
    amp=True \
    use_cer=False

## To score a manifest file which has been previously augmented with transcribed text as `pred_text`
This is useful when one uses `transcribe_speech_parallel.py` to transcribe larger datasets, and results are written
to a manifest which has the two keys `text` (for ground truth) and `pred_text` (for model's transcription)

python speech_to_text_eval.py \
    dataset_manifest=<Mandatory: Path to an ASR dataset manifest file> \
    use_cer=False \
    only_score_manifest=True

"""

import json
import os
from dataclasses import dataclass, is_dataclass
from typing import Optional

import torch
import transcribe_speech
from omegaconf import MISSING, OmegaConf, open_dict

from nemo.collections.asr.metrics.wer import word_error_rate
from nemo.collections.asr.parts.utils.transcribe_utils import PunctuationCapitalization, TextProcessingConfig
from nemo.core.config import hydra_runner
from nemo.utils import logging

@dataclass
class EvaluationConfig(transcribe_speech.TranscriptionConfig):
    dataset_manifest: str = MISSING
    output_filename: Optional[str] = "evaluation_transcripts.json"

    use_cer: bool = False
    tolerance: Optional[float] = None

    only_score_manifest: bool = False

    text_processing: Optional[TextProcessingConfig] = TextProcessingConfig(
        punctuation_marks=".,?", separate_punctuation=False, do_lowercase=False, rm_punctuation=False,
    )

@hydra_runner(config_name="EvaluationConfig", schema=EvaluationConfig)
def main(cfg: EvaluationConfig):
    torch.set_grad_enabled(False)

    if is_dataclass(cfg):
        cfg = OmegaConf.structured(cfg)

    if cfg.audio_dir is not None:
        raise RuntimeError(
            "Evaluation script requires ground truth labels to be passed via a manifest file. "
            "If manifest file is available, submit it via `dataset_manifest` argument."
        )

    if not os.path.exists(cfg.dataset_manifest):
        raise FileNotFoundError(f"The dataset manifest file could not be found at path : {cfg.dataset_manifest}")

    if not cfg.only_score_manifest:
        # Transcribe speech into an output directory
        transcription_cfg = transcribe_speech.main(cfg)  # type: EvaluationConfig

        # Release GPU memory if it was used during transcription
        #if torch.cuda.is_available():
        #    torch.cuda.empty_cache()

        logging.info("Finished transcribing speech dataset. Computing ASR metrics..")

    else:
        cfg.output_filename = cfg.dataset_manifest
        transcription_cfg = cfg

    ground_truth_text = []
    predicted_text = []
    invalid_manifest = False
    with open(transcription_cfg.output_filename, 'r') as f:
        for line in f:
            data = json.loads(line)

            if 'pred_text' not in data:
                invalid_manifest = True
                break

            ground_truth_text.append(data['text'])

            predicted_text.append(data['pred_text'])
    print(predicted_text[0])
    pc = PunctuationCapitalization(cfg.text_processing.punctuation_marks)
    if cfg.text_processing.separate_punctuation:
        ground_truth_text = pc.separate_punctuation(ground_truth_text)
        predicted_text = pc.separate_punctuation(predicted_text)
    if cfg.text_processing.do_lowercase:
        ground_truth_text = pc.do_lowercase(ground_truth_text)
        predicted_text = pc.do_lowercase(predicted_text)
    if cfg.text_processing.rm_punctuation:
        ground_truth_text = pc.rm_punctuation(ground_truth_text)
        predicted_text = pc.rm_punctuation(predicted_text)

    # Test for invalid manifest supplied
    if invalid_manifest:
        raise ValueError(
            f"Invalid manifest provided: {transcription_cfg.output_filename} does not "
            f"contain value for `pred_text`."
        )

    # Compute the WER
    cer = word_error_rate(hypotheses=predicted_text, references=ground_truth_text, use_cer=True)
    wer = word_error_rate(hypotheses=predicted_text, references=ground_truth_text, use_cer=False)

    if cfg.use_cer:
        metric_name = 'CER'
        metric_value = cer
    else:
        metric_name = 'WER'
        metric_value = wer

    if cfg.tolerance is not None:
        if metric_value > cfg.tolerance:
            raise ValueError(f"Got {metric_name} of {metric_value}, which was higher than tolerance={cfg.tolerance}")

        logging.info(f'Got {metric_name} of {metric_value}. Tolerance was {cfg.tolerance}')

    logging.info(f'Dataset WER/CER ' + str(round(100 * wer, 2)) + "%/" + str(round(100 * cer, 2)) + "%")

    # Inject the metric name and score into the config, and return the entire config
    with open_dict(cfg):
        cfg.metric_name = metric_name
        cfg.metric_value = metric_value

    return cfg

if __name__ == '__main__':
    main()  # noqa pylint: disable=no-value-for-parameter

BR, Gabi

NVIDIA / NeMo

Mismatch between val_cer checkpoints while training and speech_to_text_eval.py on same dataset. #7224