JisuHann / One-day-One-paper

Review paper
3 stars 0 forks source link

MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences (ACL2019) #13

Closed JisuHann closed 3 years ago

JisuHann commented 3 years ago

MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences

Problem

Goal

In this paper..

Overall Architecture

  1. Temporal Convolutions
    • why? to ensure that each element of the input sequences has sufficient awareness of its neighborhood elements -> to contain the local structure of the sequence
  2. Positional Embedding
    • why? to enable the sequences to carry temporal information
  3. Crossmodal Transformers
    • why?
      • to enable one modality for receiving information from another modality
      • each modality keeps updating its sequence via low-level external information from the multi-head cross modal attention module -> correlate meaningful elements across modalities'
    • 6 cross modal transformers (bc 3 modalities, pair forms a crossmodal interaction)
  4. Self-Attention Transformers and Prediction
    • the last elements of the sequence models are extracted to pass through fully-connected layers to make predictions