Closed John1684 closed 1 week ago
Hello @John1684,
There are two reasons.
@snoop2head Thank you for your response. What do you think are the reasons that the Conformer is better than the Transformer?
@John1684
Two primary distinctions between the Transformer and Conformer models are:
The Conformer incorporates convolutional layers to capture local information, as emphasized in the following:
While Transformers are good at modeling long-range global context, they are less capable to extract finegrained local feature patterns. Convolution neural networks (CNNs), on the other hand, exploit local information and are used as the de-facto computational block in vision. They learn shared position-based kernels over a local window which maintain translation equivariance and are able to capture features like edges and shapes.
I recommend papers below which include analyses on CNNs focusing on localities while self-attention mechanisms handling long-range global context. These analyses are based on perturbing inputs and observing the shift of model decisions [1], frequency analyses of Fourier transformed feature maps [2], and mean attention distance [3]. In case of our paper, we have included mean attention distance analyses in Figure 4 to show that our each layer in our Transformer encoder attends to both global and local information.
[1] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (ICLR 2019) [2] How Do Vision Transformers Work? (ICLR 2022) [3] Do Vision Transformers See Like Convolutional Neural Networks? (NeurIPS 2021)
Conformer is better than the Transformer since it focuses more on localities of input sequences. As we stated in our paper, distinguishing fine-grained homophenes is crucial in the field of visual speech recognition. Our method was to encode discriminative local information in encoder's representation by adding auxiliary classifier and using discrete audio as labels. Likewise, since Conformer has CNN layers within, I believe it exploits local features effectively thus contributing to better recognition.
From a theoretical standpoint, splitting the position-wise feedforward module into two parts, as the macaron structure does, is a more efficient design than the legacy single feedforward layer [4]. However, its impact on speech recognition tasks seems to be marginal, as shown in the ablation study in the Conformer paper attached above.
[4] Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
Thank you very much.😊
Hello, I have looked at the content of #11, and I have a question.You use the audio reconstruction loss term to some extent to enable the transformer to achieve relatively good performance, but why use the conformer instead of the transformer for sentence-level lip reading tasks?