Size mismatch for decoder when using pre-trained model for transfer

Edfame commented 2 years ago

Hello! 👋🏻

I'm trying to implement a script which allows me to train models from scratch or to use pre-trained ones (.nemo files or from the cloud) accordingly to the given config file. The goal here is to either train models for the European Portuguese language from scratch, to transfer from the pre-trained English to European Portuguese, or to transfer other Portuguese models (trained from scratch), e.g. trained with Brazilian Portuguese, to other European Portuguese.

I've seen this example and it is the one that I'm trying to replicate, but since my Portuguese labels (41) differ in size from the english ones (29) it gives me the following error:

RuntimeError: Error(s) in loading state_dict for EncDecCTCModel:
        size mismatch for decoder.decoder_layers.0.weight: copying a param with shape torch.Size([29, 1024, 1]) from checkpoint, the shape in current model is torch.Size([42, 1024, 1]).
        size mismatch for decoder.decoder_layers.0.bias: copying a param with shape torch.Size([29]) from checkpoint, the shape in current model is torch.Size([42]).

I'm setting up the field init_from_pretrained_model field in the config file (can see it below) to "QuartzNet15x5Base-En" and I intend to change it to init_from_nemo_model: "MyModel.nemo" in the other mentioned cases.

Is there any way I can make this without having to reccour to the .change_vocabulary(), .setup_training_data(), .setup_validation_data() and .setup_test_data() functions?

Code:

import nemo
import torch
import hydra
import nemo.collections.asr as nemo_asr
import pytorch_lightning as pl 
from omegaconf import DictConfig, OmegaConf

@hydra.main(config_path="./configs/")
def main(cfg: DictConfig):

    trainer = pl.Trainer(**cfg.trainer)
    model = nemo_asr.models.EncDecCTCModel(cfg=cfg.model, trainer=trainer)

    model.maybe_init_from_pretrained_checkpoint(cfg)

    # Train the model
    trainer.fit(model)

    model.save_to(f"./models/{cfg.name}_test.nemo")

if __name__ == "__main__":
    main()

Config:

name: &name "TRANSFER_ENG_TO_PT"

model:
  sample_rate: &sample_rate 16000
  repeat: &repeat 5
  dropout: &dropout 0.0
  separable: &separable true
  batch_size: &batch_size 32
  num_workers: &num_workers 256
  labels: &labels [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n",
                  "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "ç", "à", "á",
                  "â", "ã", "é", "ê", "í", "ó", "ô", "õ", "ú", "-", "'"]

  train_ds:
    manifest_filepath: "censored"
    sample_rate: *sample_rate
    labels: *labels
    batch_size: *batch_size
    trim_silence: True
    max_duration: 16.7
    shuffle: True
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null
    num_workers: *num_workers

  validation_ds:
    manifest_filepath: "censored"
    sample_rate: *sample_rate
    labels: *labels
    batch_size: *batch_size
    shuffle: False
    num_workers: *num_workers

  test_ds:
    manifest_filepath: "censored"
    sample_rate: *sample_rate
    labels: *labels
    batch_size: *batch_size
    shuffle: False
    num_workers: *num_workers

  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: "per_feature"
    window_size: 0.02
    sample_rate: *sample_rate
    window_stride: 0.01
    window: "hann"
    features: &n_mels 64
    n_fft: 512
    frame_splicing: 1
    dither: 0.00001

  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    rect_freq: 50
    rect_masks: 5
    rect_time: 120

  encoder:
    _target_: nemo.collections.asr.modules.ConvASREncoder
    feat_in: *n_mels
    activation: relu
    conv_mask: true

    jasper:
    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [33]
      repeat: 1
      residual: false
      separable: *separable
      stride: [2]

    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [33]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [33]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [33]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [39]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [39]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [39]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [51]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [51]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [51]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [63]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [63]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [63]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [75]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [75]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [75]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    - dilation: [2]
      dropout: *dropout
      filters: 512
      kernel: [87]
      repeat: 1
      residual: false
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: &enc_filters 1024
      kernel: [1]
      repeat: 1
      residual: false
      stride: [1]

  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: *enc_filters
    num_classes: 0
    vocabulary: *labels

  optim:
    name: novograd
    # _target_: nemo.core.optim.optimizers.Novograd
    lr: .01
    # optimizer arguments
    betas: [0.8, 0.5]
    weight_decay: 0.001

    # scheduler setup
    sched:
      name: CosineAnnealing

      # pytorch lightning args
      # monitor: val_loss
      # reduce_on_plateau: false

      # Scheduler params
      warmup_steps: null
      warmup_ratio: null
      min_lr: 0.0
      last_epoch: -1

trainer:
  gpus: -1 # number of gpus
  max_epochs: 100
  max_steps: -1 # computed at runtime if not set
  num_nodes: 1
  strategy: ddp
  accumulate_grad_batches: 1
  enable_checkpointing: False  # Provided by exp_manager
  logger: False  # Provided by exp_manager
  log_every_n_steps: 1  # Interval of logging.
  val_check_interval: 1.0  # Set to 0.25 to check 4 times per epoch, or an int for number of iterations

init_from_pretrained_model: "QuartzNet15x5Base-En"

exp_manager:
  exp_dir: null
  name: *name
  create_tensorboard_logger: False
  create_checkpoint_callback: False
  checkpoint_callback_params:
    monitor: "val_wer"
    mode: "min"
  create_wandb_logger: False
  wandb_logger_kwargs:
    name: null
    project: null

hydra:
  run:
    dir: .
  job_logging:
    root:
      handlers: null

VahidooX commented 2 years ago

You need to ask the init_from_nemo_model to exclude the decoder part. Please take a look here: https://github.com/NVIDIA/NeMo/blob/c15ed0469a908c3bdda859089437ecb8db845cff/nemo/core/classes/modelPT.py#L948 Currently the init_from_pretrained_model does not support it but I am going to add it next week. For now, you can download the nemo file and use init_from_nemo_model instead.

You should use something like this in the config:

init_from_nemo_model: 
    model0:
         path:"QuartzNet15x5Base-En.nemo"
         exclude: ["decoder"]

Edfame commented 2 years ago

You need to ask the init_from_nemo_model to exclude the decoder part. Please take a look here: Currently the init_from_pretrained_model does not support it but I am going to add it next week. For now, you can download the nemo file and use init_from_nemo_model instead.

Oh okay, I thought about doing that for the init_from_pretrained_model but I didn't see how to do it in the code comments, but makes sense since it is not implemented yet! 😄

When implemented the config will be more or less the same (?) , something like:

init_from_pretrained_model: 
    model0:
         name: "QuartzNet15x5Base-En"
         exclude: ["decoder"]

PS: Will this issue be mentioned on the release where this is implemented? Just so I know when to rebuild my docker images :)

titu1994 commented 2 years ago

The changelog would have some pr detail wrt this issue, but it won't be available till Nemo 1.9 (1.5 months from now) you could kinda sidestep that by copy pasting and writing your own method since it's simple enough in the meanwhile without waiting for us

Edfame commented 2 years ago

I'll either implement that snippet of code (which, as you said, sounds simple) or use every transfer model as .nemo files until the 1.9 comes out with that feature implemented.

Thanks very much for the support 😉

NVIDIA / NeMo

Size mismatch for decoder when using pre-trained model for transfer #3944