NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.56k stars 2.42k forks source link

How to write/save/load a simple custom decoder using nn.Sequential? [Question] #1445

Closed rbracco closed 3 years ago

rbracco commented 3 years ago

Describe your question

I would like to implement my own decoder for transfer learning to experiment with multiple linear layers. I did so by overriding quartznet.decoder.decoder_layers with my own nn.Sequential using the code below

quartznet.decoder.decoder_layers = nn.Sequential(nn.Conv1d(1024, 256, kernel_size=1, stride=1), nn.ReLU(), nn.Conv1d(256, 41, kernel_size=1, stride=1))

This worked and it trains well, but when I save the model and try to load it, I get the following error:

RuntimeError: Error(s) in loading state_dict for EncDecCTCModel:
    Unexpected key(s) in state_dict: "decoder.decoder_layers.2.weight", "decoder.decoder_layers.2.bias". 
    size mismatch for decoder.decoder_layers.0.weight: copying a param with shape torch.Size([256, 1024, 1]) from checkpoint, the shape in current model is torch.Size([41, 1024, 1]).
    size mismatch for decoder.decoder_layers.0.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([41]).

This is because my underlying config file still specifies the decoder as being from the class nemo.collections.asr.modules.ConvASRDecoder. I have no idea how to update the config file to use my new decoder, or how to bypass the config file altogether, and couldn't find a way to do so in the docs. Even if I load up quartznet and manually overwrite the decoder, then try to load the saved checkpoint, it fails because it seems to be using the config file behind the scenes.

Environment overview (please complete the following information)

Colab using nemo-toolkit[all]==1.0.0b1 and config from https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/conf/config.yaml

Environment details

If NVIDIA docker image is used you don't need to specify these. Python 3.6.9 Pytorch 1.7 OS: Ubuntu 18.04.5 LTS

Additional context

titu1994 commented 3 years ago

You should not be overriding your decoder as such. You should use a custom neural module.

1) Make your custom decoder extend neural module class 2) Override the nemo.collections.asr.modules.ConvASRDecoder classpath with the classpath of your new decoder. 3) Create the model with this config and train.

During evaluation, make sure to import the custom decoder before you use restore from.

rbracco commented 3 years ago

Thank you, I will try this and follow up if I have any questions. If I get it working I will leave an example for future people with the same issue.

rbracco commented 3 years ago

Okay so I worked on this for a while and then got stuck. I wrote a TwoLayerDecoder class that extends NeuralModule but I'm not sure how to use it with a pretrained model. I have two questions.

  1. Once I load quartznet w pretrained weights, how do I then swap in my decoder using the config file, but keeping the pretrained weights in the encoder?
  2. What is the best way to achieve my two layer decoder? Should I write a new class extending NeuralModule and copy in most of the code from nemo.collections.asr.modules.ConvASRDecoder and then change what I need? Or is it okay to inherit from nemo.collections.asr.modules.ConvASRDecoder and just write a new __init__ and forward? Thank you!
titu1994 commented 3 years ago

For changing the decoder without changing vocabulary size (keep 28 char vocab size of QN) it's done as follows

For 1) you should be able to do quartznet.decoder = MyNewTwoLayerDecoder(). Don't forget to update quartzNet.cfg.decoder._target_ with the new class path to your file, otherwise restore from won't work.

That's about it.

For a decoder that changes the vocabulary size, first use the quartznet.change_vocabulary() method then do the above steps.

For 2) inheriting is fine, only if you override the decoder layer correctly so as to remove the previous created weights with your own. For clean approach, I'd say copy paste the code and edit the portions you wish to change.

rbracco commented 3 years ago

Thank you, so I actually tried 3 different ways...

  1. Copy code and edit then do quartznet.decoder = MyNewTwoLayerDecoder(*args)
  2. Inheritance
  3. Copy code and edit and then create a new model from config and do quartznet.decoder=new_model.decoder (this one mainly to make sure the config file would work)

All 3 methods successfully combine the encoder pretrained weights and the new decoder. I then freeze the encoder layers and fit, but in each case I get KeyError: 30 for the line reference = ''.join([self.labels_map[c] for c in target]) in wer.py. I have 41 classes (including blank label) and the correct vocab and # of classes appears when I run quartznet.decoder.vocabulary but when I enter the debugger the labels_map attribute of WER (a dict mapping int values to labels) is still using the original English alphabet. I did make sure to call quartznet.change_vocabulary(new_vocabulary=my_new_vocab) but somehow it doesn't fix it.

Any ideas? Please let me know if this warrants a new issue and in the meantime I'll keep digging. Thanks so much for your time and help.

rbracco commented 3 years ago

Oh wow, so 2 seconds after posting this, I reran the code but swapping the order of the execution to be

quartznet.change_vocabulary(new_vocabulary=my_new_vocab)
quartznet.decoder = ConvASRDecoderTwo(1024, 40, vocabulary=my_new_vocab)

instead of

quartznet.decoder = ConvASRDecoderTwo(1024, 40, vocabulary=my_new_vocab)
quartznet.change_vocabulary(new_vocabulary=my_new_vocab)

and the KeyError disappeared. It appears that you need to change the vocabulary on the pretrained model prior to instantiating the new decoder.

I'm still not sure everything is worked out because the loss is coming down much more slowly than when I overwrote the decoder manually with quartznet.decoder.decoder_layers = nn.Sequential(nn.Conv1d(1024, 256, kernel_size=1, stride=1), nn.ReLU(), nn.Conv1d(256, 41, kernel_size=1, stride=1)). I'll keep digging and report back.

titu1994 commented 3 years ago

The error dissapears but actually now wer will be incorrect - it will assume the CTC blank I'd is 29, but you have 41 labels.

I would rather suggest this (apologies for the roundabout way above) -

1) create neural module 2) change QuartzNet.cfg.decoder._target_ to classpath of new decoder 3) simply call change vocabulary.

That should be all that's actually needed. Please let me know If this works

rbracco commented 3 years ago

Thank you for this. I'm trying it now but am running into some issues setting the config. I started with code from the ASR with NeMo tutorial and they appear to use Hydra 1.0. I see that in Hydra 1.1 cls is replaced with _target_ in the config files, so I followed your instructions but for step 2 I changed quartznet.cfg.decoder.cls to the classpath of my new decoder. When I call change_vocabulary however, my decoder doesn't change.

Exactly what I did is below, any ideas?

Step 1: Create Neural Module

All I changed here is A. Changed the decoder layers from a single 1d conv to 1dconv->relu->1dconv. B. Commented out the line that does weight initialization for Nemo.

class ConvASRDecoderTwo(NeuralModule, Exportable):
    """Simple ASR Decoder for use with CTC-based models such as JasperNet and QuartzNet
     Based on these papers:
        https://arxiv.org/pdf/1904.03288.pdf
        https://arxiv.org/pdf/1910.10261.pdf
        https://arxiv.org/pdf/2005.04290.pdf
    """

    def save_to(self, save_path: str):
        pass

    @classmethod
    def restore_from(cls, restore_path: str):
        pass

    @property
    def input_types(self):
        return OrderedDict({"encoder_output": NeuralType(('B', 'D', 'T'), AcousticEncodedRepresentation())})

    @property
    def output_types(self):
        return OrderedDict({"logprobs": NeuralType(('B', 'T', 'D'), LogprobsType())})

    def __init__(self, feat_in, num_classes, init_mode="xavier_uniform", vocabulary=None):
        super().__init__()
        if vocabulary is not None:
            if num_classes != len(vocabulary):
                raise ValueError(
                    f"If vocabulary is specified, it's length should be equal to the num_classes. Instead got: num_classes={num_classes} and len(vocabulary)={len(vocabulary)}"
                )
            self.__vocabulary = vocabulary
        self._feat_in = feat_in
        # Add 1 for blank char
        self._num_classes = num_classes + 1

        self.decoder_layers = torch.nn.Sequential(
            torch.nn.Conv1d(self._feat_in, 256, kernel_size=1, bias=True),
            torch.nn.ReLU(),
            torch.nn.Conv1d(256, self._num_classes, kernel_size=1, bias=True),
        )
        #self.apply(lambda x: init_weights(x, mode=init_mode))

    @typecheck()
    def forward(self, encoder_output):
        return torch.nn.functional.log_softmax(self.decoder_layers(encoder_output).transpose(1, 2), dim=-1)

    def input_example(self):
        """
        Generates input examples for tracing etc.
        Returns:
            A tuple of input examples.
        """
        bs = 8
        seq = 64
        input_example = torch.randn(bs, self._feat_in, seq).to(next(self.parameters()).device)
        return tuple([input_example])

    def _prepare_for_export(self):
        m_count = 0
        for m in self.modules():
            if type(m).__name__ == "MaskedConv1d":
                m.use_mask = False
                m_count += 1
        if m_count > 0:
            logging.warning(f"Turned off {m_count} masked convolutions")
        Exportable._prepare_for_export(self)

    @property
    def vocabulary(self):
        return self.__vocabulary

    @property
    def num_classes_with_blank(self):
        return self._num_classes

Step 2 and 3: change QuartzNet.cfg.decoder.cls to classpath of new decoder, call change_vocabulary

image