athena-team / athena

an open-source implementation of sequence-to-sequence based speech processing engine
https://athena-team.readthedocs.io
Apache License 2.0
954 stars 197 forks source link

[Add] Add convolution module for Transformer. #328

Closed TeaPoly closed 4 years ago

TeaPoly commented 4 years ago

Inspired by section 2.2 of this paper for a description of this technique: Conformer: Convolution-augmented Transformer for Speech Recognition

AISHELL best result util 10 epochs is here:

CTC Attention Conv Kernel Size
8.56 % 6.33% 0 27.3M
7.31% 6.00% 15 29.7M

my config examples/asr/aishell/configs/mtl_transformer_sp.json context is:

{
    "batch_size": 32,
    "num_epochs": 20,
    "sorta_epoch": 1,
    "ckpt": "examples/asr/aishell/ckpts/mtl_transformer_ctc_conv/",
    "summary_dir": "examples/asr/aishell/ckpts/mtl_transformer_ctc_conv/event",

    "solver_gpu": [0],
    "solver_config": {
        "clip_norm": 100,
        "log_interval": 10,
        "enable_tf_function": true
    },

    "model": "mtl_transformer_ctc",
    "num_classes": null,
    "pretrained_model": null,
    "model_config": {
        "model": "speech_transformer",
        "model_config": {
            "return_encoder_output": true,
            "num_filters": 256,
            "d_model": 256,
            "num_heads": 4,
            "num_encoder_layers": 12,
            "num_decoder_layers": 6,
            "dff": 2048,
            "rate": 0.1,
            "conv_module_kernel_size": 15,
            "label_smoothing_rate": 0.1,
            "schedual_sampling_rate": 0.9
        },
        "mtl_weight": 0.5
    },

    "inference_config": {
        "decoder_type": "beam_search_decoder",
        "model_avg_num": 10,
        "beam_size": 10,
        "ctc_weight": 0.5,
        "lm_weight": 0.7,
        "lm_type": "rnn",
        "lm_path": "examples/asr/aishell/configs/rnnlm.json"
    },

    "optimizer": "warmup_adam",
    "optimizer_config": {
        "d_model": 512,
        "warmup_steps": 25000,
        "k": 1.0,
        "decay_steps": 100000000,
        "decay_rate": 0.1
    },

    "dataset_builder": "speech_recognition_dataset",
    "num_data_threads": 6,
    "trainset_config": {
        "data_csv": "examples/asr/aishell/data/train.csv",
        "audio_config": { "type": "Fbank", "filterbank_channel_count": 80 },
        "cmvn_file": "examples/asr/aishell/data/cmvn",
        "text_config": { "type": "vocab", "model": "examples/asr/aishell/data/vocab" },
        "input_length_range": [10, 8000],
        "speed_permutation": [0.9, 1.0, 1.1]
    },
    "devset_config": {
        "data_csv": "examples/asr/aishell/data/dev.csv",
        "audio_config": { "type": "Fbank", "filterbank_channel_count": 80 },
        "cmvn_file": "examples/asr/aishell/data/cmvn",
        "text_config": { "type": "vocab", "model": "examples/asr/aishell/data/vocab" },
        "input_length_range": [10, 8000]
    },
    "testset_config": {
        "data_csv": "examples/asr/aishell/data/test.csv",
        "audio_config": { "type": "Fbank", "filterbank_channel_count": 80 },
        "cmvn_file": "examples/asr/aishell/data/cmvn",
        "text_config": { "type": "vocab", "model": "examples/asr/aishell/data/vocab" }
    }
}
Some-random commented 4 years ago

@TeaPoly Thank you for this excellent pr! I have one general question regarding Conformer. It looks like adding conv blocks and macaron-like ffn might hurt RTF. Have you tested the RTF difference (like in argmax mode) between Conformer and Transformer?

TeaPoly commented 4 years ago

@TeaPoly Thank you for this excellent pr! I have one general question regarding Conformer. It looks like adding conv blocks and macaron-like ffn might hurt RTF. Have you tested the RTF difference (like in argmax mode) between Conformer and Transformer?

I only add convolution module not macaron FFN. I have not tested the RTF difference between Conformer and Transformer. In addition, I try to apply macaron FFN and convolution module in one model, but the size of the model has increased too much, and I haven't adjusted it to the same parameters for comparison experiments. Here is result util 10 epochs.

CTC Attention Macaron Conv Size
8.56% 6.33% 27.3M
7.31% 6.00% 29.7M
6.77% 5.69% 42.3M
Some-random commented 4 years ago

The WER reduction from Macaron looks promising! If I'm understanding correctly, the parameters for two green Feed Forward Module in Fig 1 from the origin paper are shared. The extra parameters come from the extra linear layer in Fig 4. So if we just keep the same number of linear layers in Feed forward module we should have a similar number of parameters?

Some-random commented 4 years ago

I've just glanced through the original Macaron paper and I'm not even sure whether my understanding is correct. I think we can check the implementation from Macaron repo and espnet and let's discuss later.

TeaPoly commented 4 years ago

The WER reduction from Macaron looks promising! If I'm understanding correctly, the parameters for two green Feed Forward Module in Fig 1 from the origin paper are shared. The extra parameters come from the extra linear layer in Fig 4. So if we just keep the same number of linear layers in Feed forward module we should have a similar number of parameters?

I have review ESPnet and Lingvo sources code. It seems that the two parts do not share parameters.

ESPnet espnet/espnet2/asr/encoder/conformer_encoder.py +200:

        self.encoders = repeat(
            num_blocks,
            lambda lnum: EncoderLayer(
                output_size,
                encoder_selfattn_layer(*encoder_selfattn_layer_args),
                positionwise_layer(*positionwise_layer_args),
                positionwise_layer(*positionwise_layer_args) if macaron_style else None,
                convolution_layer(*convolution_layer_args) if use_cnn_module else None,
                dropout_rate,
                normalize_before,
                concat_after,
            ),
        )

Lingvo lingvo/lingvo/core/conformer_layer.py +427:

    fflayer_start_p = p.fflayer_start_tpl.Copy().Set(
        input_dim=p.input_dim,
        hidden_dim=p.fflayer_hidden_dim,
        activation='SWISH',
        residual_weight=p.fflayer_residual_weight,
        residual_dropout_prob=p.dropout_prob,
        relu_dropout_prob=p.dropout_prob)
    self.CreateChild('fflayer_start', fflayer_start_p)

    fflayer_end_p = p.fflayer_end_tpl.Copy().Set(
        input_dim=p.input_dim,
        hidden_dim=p.fflayer_hidden_dim,
        activation='SWISH',
        residual_weight=p.fflayer_residual_weight,
        residual_dropout_prob=p.dropout_prob,
        relu_dropout_prob=p.dropout_prob)
    self.CreateChild('fflayer_end', fflayer_end_p)

I check the Macaron structure original paper Understanding and improving transformer from a multi-particle dynamic system point of view:

截屏2020-11-16 下午4 34 54

It seams using different parameters, but has same structure (like hidden dimension and activation).

TeaPoly commented 4 years ago

This is part of training log in ESPnet, feed_forward and feed_forward_macaron has different scope name:

      (1): EncoderLayer(
        (self_attn): RelPositionMultiHeadedAttention(
          (linear_q): Linear(in_features=256, out_features=256, bias=True)
          (linear_k): Linear(in_features=256, out_features=256, bias=True)
          (linear_v): Linear(in_features=256, out_features=256, bias=True)
          (linear_out): Linear(in_features=256, out_features=256, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
          (linear_pos): Linear(in_features=256, out_features=256, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=256, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation): Swish()
        )
        (feed_forward_macaron): PositionwiseFeedForward(
          (w_1): Linear(in_features=256, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation): Swish()
        )
        (conv_module): ConvolutionModule(
          (pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,))
          (depthwise_conv): Conv1d(256, 256, kernel_size=(15,), stride=(1,), padding=(7,), groups=256)
          (norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
          (activation): Swish()
        )
        (norm_ff): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
        (norm_mha): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
        (norm_ff_macaron): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
        (norm_conv): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
        (norm_final): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )

Case closed 😄.

Some-random commented 4 years ago

The WER reduction from Macaron looks promising! If I'm understanding correctly, the parameters for two green Feed Forward Module in Fig 1 from the origin paper are shared. The extra parameters come from the extra linear layer in Fig 4. So if we just keep the same number of linear layers in Feed forward module we should have a similar number of parameters?

I have review ESPnet and Lingvo sources code. It seems that the two parts do not share parameters. I check the Macaron structure original paper Understanding and improving transformer from a multi-particle dynamic system point of view:

截屏2020-11-16 下午4 34 54

It seams using different parameters.

Thank you for the thorough investigation! In this case, I think the extra parameters came from the three extra ffn layers introduced by Macaron and the new Feed forward module (conformer has 4 ffn layers while transformer only has 1). To minimize the effect of num_parameter, we can just use one ffn for the Feed forward module. I'm also curious whether using the Macaron module instead of Single FFN would bring improvements on AISHELL (like the experiments in Table 5 of original paper).