Closed TeaPoly closed 4 years ago
@TeaPoly Thank you for this excellent pr! I have one general question regarding Conformer. It looks like adding conv blocks and macaron-like ffn might hurt RTF. Have you tested the RTF difference (like in argmax mode) between Conformer and Transformer?
@TeaPoly Thank you for this excellent pr! I have one general question regarding Conformer. It looks like adding conv blocks and macaron-like ffn might hurt RTF. Have you tested the RTF difference (like in argmax mode) between Conformer and Transformer?
I only add convolution module not macaron FFN. I have not tested the RTF difference between Conformer and Transformer. In addition, I try to apply macaron FFN and convolution module in one model, but the size of the model has increased too much, and I haven't adjusted it to the same parameters for comparison experiments. Here is result util 10 epochs.
CTC | Attention | Macaron | Conv | Size |
---|---|---|---|---|
8.56% | 6.33% | ✕ | ✕ | 27.3M |
7.31% | 6.00% | ✕ | ✓ | 29.7M |
6.77% | 5.69% | ✓ | ✓ | 42.3M |
The WER reduction from Macaron looks promising! If I'm understanding correctly, the parameters for two green Feed Forward Module in Fig 1 from the origin paper are shared. The extra parameters come from the extra linear layer in Fig 4. So if we just keep the same number of linear layers in Feed forward module we should have a similar number of parameters?
The WER reduction from Macaron looks promising! If I'm understanding correctly, the parameters for two green Feed Forward Module in Fig 1 from the origin paper are shared. The extra parameters come from the extra linear layer in Fig 4. So if we just keep the same number of linear layers in Feed forward module we should have a similar number of parameters?
I have review ESPnet and Lingvo sources code. It seems that the two parts do not share parameters.
ESPnet espnet/espnet2/asr/encoder/conformer_encoder.py +200
:
self.encoders = repeat(
num_blocks,
lambda lnum: EncoderLayer(
output_size,
encoder_selfattn_layer(*encoder_selfattn_layer_args),
positionwise_layer(*positionwise_layer_args),
positionwise_layer(*positionwise_layer_args) if macaron_style else None,
convolution_layer(*convolution_layer_args) if use_cnn_module else None,
dropout_rate,
normalize_before,
concat_after,
),
)
Lingvo lingvo/lingvo/core/conformer_layer.py +427
:
fflayer_start_p = p.fflayer_start_tpl.Copy().Set(
input_dim=p.input_dim,
hidden_dim=p.fflayer_hidden_dim,
activation='SWISH',
residual_weight=p.fflayer_residual_weight,
residual_dropout_prob=p.dropout_prob,
relu_dropout_prob=p.dropout_prob)
self.CreateChild('fflayer_start', fflayer_start_p)
fflayer_end_p = p.fflayer_end_tpl.Copy().Set(
input_dim=p.input_dim,
hidden_dim=p.fflayer_hidden_dim,
activation='SWISH',
residual_weight=p.fflayer_residual_weight,
residual_dropout_prob=p.dropout_prob,
relu_dropout_prob=p.dropout_prob)
self.CreateChild('fflayer_end', fflayer_end_p)
I check the Macaron structure original paper Understanding and improving transformer from a multi-particle dynamic system point of view:
It seams using different parameters, but has same structure (like hidden dimension and activation).
This is part of training log in ESPnet, feed_forward
and feed_forward_macaron
has different scope name:
(1): EncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=256, out_features=256, bias=True)
(linear_k): Linear(in_features=256, out_features=256, bias=True)
(linear_v): Linear(in_features=256, out_features=256, bias=True)
(linear_out): Linear(in_features=256, out_features=256, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=256, out_features=256, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=256, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=256, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(activation): Swish()
)
(feed_forward_macaron): PositionwiseFeedForward(
(w_1): Linear(in_features=256, out_features=2048, bias=True)
(w_2): Linear(in_features=2048, out_features=256, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(activation): Swish()
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(256, 256, kernel_size=(15,), stride=(1,), padding=(7,), groups=256)
(norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
(activation): Swish()
)
(norm_ff): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(norm_mha): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(norm_conv): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(norm_final): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
Case closed 😄.
The WER reduction from Macaron looks promising! If I'm understanding correctly, the parameters for two green Feed Forward Module in Fig 1 from the origin paper are shared. The extra parameters come from the extra linear layer in Fig 4. So if we just keep the same number of linear layers in Feed forward module we should have a similar number of parameters?
I have review ESPnet and Lingvo sources code. It seems that the two parts do not share parameters. I check the Macaron structure original paper Understanding and improving transformer from a multi-particle dynamic system point of view:
It seams using different parameters.
Thank you for the thorough investigation! In this case, I think the extra parameters came from the three extra ffn layers introduced by Macaron and the new Feed forward module (conformer has 4 ffn layers while transformer only has 1). To minimize the effect of num_parameter, we can just use one ffn for the Feed forward module. I'm also curious whether using the Macaron module instead of Single FFN would bring improvements on AISHELL (like the experiments in Table 5 of original paper).
Inspired by section 2.2 of this paper for a description of this technique: Conformer: Convolution-augmented Transformer for Speech Recognition
AISHELL best result util 10 epochs is here:
my config
examples/asr/aishell/configs/mtl_transformer_sp.json
context is: