KAIST-AILab / SyncVSR

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization (Interspeech 2024)
https://www.isca-archive.org/interspeech_2024/ahn24_interspeech.pdf
MIT License
20 stars 2 forks source link

[Training] Model architecture question for sentence-level VSR tasks #19

Closed John1684 closed 1 week ago

John1684 commented 1 week ago

Hello, I have looked at the content of #11, and I have a question.You use the audio reconstruction loss term to some extent to enable the transformer to achieve relatively good performance, but why use the conformer instead of the transformer for sentence-level lip reading tasks?

snoop2head commented 1 week ago

Hello @John1684,

There are two reasons.

  1. For sentence-level VSR, we intentionally used the same architecture and development setup as AutoAVSR, which uses conformer, to ensure a fair comparison. This allowed us to highlight the data-efficiency of our training framework without introducing confounding architectural differences, particularly in the LRS3 results.
  2. For word-level VSR, we tested the conformer during the rebuttal period following a reviewer's suggestion, and the results were indeed better. However, due to Interspeech 2024 camera-ready guidelines, which state "Major revisions are NOT permitted (even if requested by a Reviewer), such as: inclusion of new research, new experimental results...", we couldn't include these results in the paper. We plan to leave them for future work instead.
John1684 commented 1 week ago

@snoop2head Thank you for your response. What do you think are the reasons that the Conformer is better than the Transformer?

snoop2head commented 1 week ago

@John1684

Two primary distinctions between the Transformer and Conformer models are:

  1. The inclusion of a convolutional block.
  2. The use of macaron feedforward modules instead of a single feedforward layer.

1. Insertion of convolution block

Screenshot 2024-10-25 at 11 28 56 AM

The Conformer incorporates convolutional layers to capture local information, as emphasized in the following:

While Transformers are good at modeling long-range global context, they are less capable to extract finegrained local feature patterns. Convolution neural networks (CNNs), on the other hand, exploit local information and are used as the de-facto computational block in vision. They learn shared position-based kernels over a local window which maintain translation equivariance and are able to capture features like edges and shapes.

I recommend papers below which include analyses on CNNs focusing on localities while self-attention mechanisms handling long-range global context. These analyses are based on perturbing inputs and observing the shift of model decisions [1], frequency analyses of Fourier transformed feature maps [2], and mean attention distance [3]. In case of our paper, we have included mean attention distance analyses in Figure 4 to show that our each layer in our Transformer encoder attends to both global and local information.

[1] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness (ICLR 2019) [2] How Do Vision Transformers Work? (ICLR 2022) [3] Do Vision Transformers See Like Convolutional Neural Networks? (NeurIPS 2021)

Conformer is better than the Transformer since it focuses more on localities of input sequences. As we stated in our paper, distinguishing fine-grained homophenes is crucial in the field of visual speech recognition. Our method was to encode discriminative local information in encoder's representation by adding auxiliary classifier and using discrete audio as labels. Likewise, since Conformer has CNN layers within, I believe it exploits local features effectively thus contributing to better recognition.

2. Use of macaron feed forward modules instead of single feed forward layer

Screenshot 2024-10-25 at 2 21 55 PM

From a theoretical standpoint, splitting the position-wise feedforward module into two parts, as the macaron structure does, is a more efficient design than the legacy single feedforward layer [4]. However, its impact on speech recognition tasks seems to be marginal, as shown in the ablation study in the Conformer paper attached above.

[4] Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

John1684 commented 1 week ago

Thank you very much.😊