[20230413] Weekly VLM2 - Flamingo

SoongE commented 1 year ago

Paper Flamingo: a Visual Language Model for Few-Shot Learning (a.k.a. Flamingo)

Speaker @SoongE

Summary CleanShot 2023-04-13 at 16 31 25

Freezing Vision and Language model
- Vision Encoder:
- Train on contrastive learning using BERT
- Train with ALIGN + LTIP by accumulation methods
- Fine-tuning or scratch instead of freezing resultes in a very large performance drop. They attribute this to catastrophic forgetting that occurs as the learning objective is refreshed.
Peceiver Resampler
- Return fixed output shape of vision input
- Fixed shape of latent query
- 실험적으로 기존 attention보다 좋다
Gated Cross-Attention
- Tanh gate: Long short-term memory(LSTM)
- normalization 효과
Train on mixture of datasets
- Dataset의 양과 quality에 따라 weight를 다르게줬다. (M3W, ALIGN, LTIP and VTP with weights 𝜆𝑚 of 1.0, 0.2, 0.2 and 0.03 respectively.)
- M3W: interleaved image-text
- 43M HTML dataset
- ALIGN and LTIP: image-text pair
- ALIGN: large and low quality
- LTIP: small and high quality
- VTP: video-text pair
- 27M with short video about 22sec
  strengths and weaknesses
Strengths
- 많은 downstream task에서 좋은 성능을 보임
Weaknesses
- LM의 side effect를 모두 가져온다.
- Classification은 CLIP보다 좋지 않다.
- Few-shot이 아닐 경우에는 각자의 모델이 더 좋은 성능을 낼 수 있다.
- 학습에 사용한 dataset이 매우 크고, 모델 자체의 사이즈가 매우 커서 공정한 비교가 힘들다.