Summary

Getting its popularity of AI assistants based on instruction following LLMs, and rise of demands of LMM AI assistants (visual + textual understanding) → however, lack of multimodal instruction following datasets (painpoint)
built multimodal instruction following datasets with text-only LLM
propose LLaVa: Large Language and Vision Assistant, an end-to-end trained large multimodal model → showed its superiority on ~~ (정확히 어떤 벤치마크에 성능이 좋았는지는 더 봐야 알 것 같음)
construct two evaluation benchmarks with diverse and challenging application-oriented tasks

Method Highlights

Multi-modal instruction following datasets (GPT-assisted Visual Instruction Data Generation)

image-pair data 기반으로 LLM에 feed해서 instruction-following data collection 제안
Image $\bf{\mathrm{X}}_v$ and its associated caption $\bf{\mathrm{X}}_c$ 가 주어진다면, 해당 이미지를 describe하도록 assistant를 instruct하는 형태의 question data인 $\bf{\mathrm{X}}_q$를 만들어내는 건(by prompting GPT4 to curate such a list of questions) 그렇게 어렵지 X.
naive version
- Human: $\bf{\mathrm{X}}_q$ $\bf{\mathrm{X}}_v$ Assistant: $\bf{\mathrm{X}}_c$
- 구축하기는 쉽지만, diversity와 in-depth reasoning이 부족함 (instruction과 responses 모두에)
  - diversity가 부족하다는 뜻은 question과 answer가 too bland? 다 비슷비슷하다? 이건가..
  - in-depth reasoning 부족하다는 것도 정확히 어떤 뜻인지 좀 궁금함.
해당 Naive version의 문제를 해결하기 위해서, language-only GPT-4와 ChatGPT를 strong teacher로 활용
- visual feature를 text-only GPT에 feed해주기 위해서 다음과 같이 두 가지 타입의 symbolic representations를 활용함. (COCO 이미지 활용)
  - Caption → 전체적인 visual scene에 대한 정보를 넣어주기 위함.
  - Bounding boxes → object에 대한 정보를 넣어주기 위함.
- three types of instruction-following data: 이 경우에는 처음에 몇 개 example은 manually design해서 넣어줌 (these are the only human annotations they have during data collections)
  - Conversation
    - assistant의 답변은 이미지를 보고 주어진 질문에 답하는 형태로 구성함.
    - 이미지에 대한 질문은 object type에 대한 질문, object counting, object actions, object locations, relative positions between objects 등을 포함.
    - 답이 확실히 있는 질문만 하도록 설계함.
  - Detailed Description
    - Created a list of questions with such an intent to include a rich and comprehensive description for an image.
  - Complex Reasoning
    - 앞의 두 타입의 instruction following data를 기반으로 in-depth reasoning questions를 생성. (이 경우에는 정답이 step by step의 reasoning process를 따라가는 형식)
- 예시
총 158K개의 unique한 language-image instruction-following sample 수집 → 58K in conversations, 23K in detailed description, and 77k in complex reasoning, respectively.

GPT-4가 consistent하게 high-quality instruction-following data를 생성해줌.
image-based conversation 생성하는 프롬프트 예시

→ 일단 “role”: “system”으로 날리고, content로 assistant에 상황을 가정해서 tone을 다듬는 과정이 있는 것 같음.

→ instruction following data라는 게 별 게 아니라 instruction - answer datasets를 얘기하는 것 같음.

Visual Instruction Tuning

VE: CLIP visual encoder ViT-L/14
LLM과 VE의 embedding space를 align하기 위한 learnable projection $\mathrm{W}$ 추가
- lightweight → more sophisticated schemes to be considered for further improvement
Training

generate multi-turn conversation data

where T is the total number of turns. We organize them as a sequence,

$L$이 sequence 길이를 의미함. 앞서 sequence가 multi-turn conversation을 엮어낸 걸로 봤을 때 3번 식은 이전까지의 instruction (정답을 요구하는 instruction)과 answer를 의미함. → 하나의 conversation에 대한 objective function이 이렇게 계산이 될 거 같고..

→ 3번 식에서의 conditionals에 $\mathrm{X}_v$가 들어가는 이유는 정말 loss를 compute할 때 이미지가 다 들어간다는 게 아니라, 모든 답변은 image에 ground되어 있다는 점을 강조하기 위해서임.

→ 여기에도 토큰이랑 system message 부분도 실제로는 들어가는데 수식에서는 생략함.

→ model이 답변을 생성하고 어디에서 멈출지를 학습하는 형태로 가게 되어 있다.

Training의 경우에는 two-stage 구조를 채택함. Feature alignment를 위한 stage 1, Fine-tuning을 위한 stage 2.
- Stage 1
  - filter CC3M to 595K
  - QA pairs converted to the instruction-following data using the naive expansion method (Each sample treated as a single-turn conversation)
  - Keep VE and LLM weights frozen, only trained projection matrix.
  - Image features can be aligned with the pre-trained LLM word embedding.
    
    (약간 Q-Former를 엄청 간단하게 구현해놓은 느낌)
  - Deemed as training a compatible visual tokenizer for the frozen LLM (online visual tokenizer)
- Stage 2
  - keep VE frozen, update projection and LLM
  - 두 개의 사용 시나리오를 가정
    1. Multimodal Chatbot: fine-tuned the model on the 158K language-image instruction-following data. 각 three type을 uniformly sampling해서 모델에 feed해줌. 이때, conversation 타입은 멀티턴이고 나머지는 싱글턴.
    2. Science QA: ScienceQA 벤치마크의 question & context를 instruction으로 만들어주고 reasoning & answer를 $\mathrm{X}_a$로 만들어줌. 멀티턴으로 구성이 안되니까 그냥 single turn conversation으로 만들어줌.

Experiment Highlights

train all models with 8 $\times$ A100s
training details
- pre-train on filtered CC-595K subset for 1 epoch with lr 2e-3 and a batch size of 128
- fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32

Multimodal Chatbot

정성 평가 → LLaVA: more comprehensive response than GPT-4 and follow instructions well unlike BLIP-2 and OpenFlamingo

→ 해당 이미지는 out-of-domain의 이미지임에도 불구하고 LLaVA가 이미지를 잘 이해하고 instruction을 잘 따르고 있음을 알 수 있음 (BLIP-2와 OpenFlamingo는 이미지에 대한 묘사를 할 뿐 instruction을 따라서 대답하는 형태는 아님).
정량 평가
- evaluation protocols & metrics
  
  Inspired by Vicuna, 평가를 위해 이미지 - gt textual descriptions - question 의 triplet을 구성. text-only GPT-4에 gt와 question만을 feed해서 reference prediction (approximate theoretical upper bound)를 생성. 이후, 평가할 모델과 reference prediction을 image(in the format of textual descriptions)-question pair와 같이 text-only GPT-4에 feed해서 1에서 10 사이의 점수로 정량 평가 진행 (The higher, The better). text-only GPT-4 모델을 기준으로 상대 점수를 report.
  - LLaVA-Bench (논문에서 제안)
    
    COCO와 In-the-Wild 두 가지로 구성.
    
    LLaVA-Bench (COCO)의 경우에는 instruction-following data 내부의 3가지 타입의 response에 대해서 ablation을 진행한 결과를 논문에서 제시함. No Instruction Tuning과의 비교를 보았을 때, Instruction Tuning을 하게 될 경우 확실히 세 태스크에서의 성능이 압도적으로 향상되는 걸 확인 가능. 또한, conversation data만을 사용했을 때에 full data로 training한 것 대비 conversation, detailed description, complex reasoning에 있어서 성능이 떨어짐. 특히, Detail Description과 complex reasoning에 있어서 매우 떨어짐.
    
    재미있는 건 Conversation data 없이도 Detailed description과 Complex reasoning type만 training data에 추가가 되어도 conversation 성능에 큰 문제는 없었다. 다만, complex reasoning 같은 경우엔 좀 떨어짐. 세 가지 타입 간의 시너지가 존재함을 확인할 수 있었음.
    
    In the wild 데이터셋에서는 OpenFlamingo와 BLIP-2에 대한 비교를 수행. 전체적으로 LLaVA 성능이 비교 모델 대비 우수.
    
    Limitation으로 꼽는 점은 상품 정보 등 실제로 fact를 알고 있어야 하는 것들이나, 이미지를 bag of patch로 인식하여서 strawberry와 yogurt가 동시에 이미지에 등장하면 straberry-flavored yogurt가 존재한다고 오판하는 문제가 생김. 이미지 내부의 보다 복잡한 semantic을 알고 있는 게 중요할 듯.
  - ScienceQA LLaVA 논문에 따르면, GPT-4와 LLaVA가 다른 대답을 했을 때 이 두 가지를 다시 question과 함께 프롬프팅해서 최종 답변을 하도록 만드는 방식이 어느 정도 CoT(Chain of Thought)와 닮았다고 함 (but with the external knowledge from the other model) → Why? 찾아볼 것. 뭐 하여튼 이 방식이 SOTA 성능을 보여주었다고 함. Text-only GPT-4를 활용하게 되면 이미지를 컨텍스트로 활용하는 QA에서도 성능이 전반적으로 향상됨. → 이는 왜냐하면 이미지를 컨텍스트로 갖는 질문 중에 사실상 정답이 이미지와 관련이 없는 질문도 있기 때문임. 이런 경우에는 이미지가 bias로 작용할 수 있음. 이때, LLaVA의 mistake을 text-only GPT-4가 어느 정도 보정을 해주게 됨. GPT-4를 모델 앙상블링에 활용하는 것도 역시 처음이다 라는 점을 이야기하고 있음. 그리고 이게 모델 성능 향상에 consistently 기여한다는 점도.
Ablation

Before vs. Last Visual feature 같은 경우에는 last feature를 쓰느냐 아니면 before last feature (penultimate feature)를 쓰느냐에 대해서 ablation을 수행하고 있음. Before the last layer가 나음. 이는 last layer는 또 너무 global & abstract한 정보만 담고 있기 때문에 이미지 내 다양한 local한 정보는 반영하지 못할 것으로 추정.
Reasoning order CoT-like reasoning-first strategy (이건 이러이러하고, 저건 저러저러하네, 그래서 답은 이거야 이런 식으로. 추론 먼저 하고 나중에 답을 얘기하는 형태를 의미하는 것 같다.) vs. Predict answer first(두괄식으로 답 먼저 말하고, 그 다음에 reasoning을 진행하는 걸 의미하는 듯.) → answer와 reasoning 간의 순서는 최종 성능에는 영향을 미치지 않지만, 해당 성능에 얼마나 빨리 도착하느냐에는 영향을 미침 (predict reasoning first is better) → 수렴 속도에는 reasoning을 먼저 predict하도록 하는 전략이 더 유리하다는 소리.
Pre-training pre-training을 거쳐서 feature alignment를 진행하지 않고 바로 fine-tuning하는 형태 -> 성능이 좀 많이 떨어짐 (5.11 정도 하락). 이건 근데 실험 세팅을 좀 더 보면 좋을 것 같은 게 training time을 동일하게 가져간다면?? 물론 그렇게 하면 오히려 overfitting이 발생해서 발산할 수도 있고.
Model size 13B vs. 7B (# parameters) -> 13B가 낫다. (Model scale이 좀 중요하더라)

Conclusion

Visual 정보를 담아서 (그리고 잘 이해해서) instruction을 잘 follow해서 answer를 잘 하는 multi-modal chatbot assistant를 잘 구축해보자. 를 성공적으로 수행한 첫번째 논문

YoojLee / paper_review

Visual Instruction Tuning (2023) #68

Summary

Method Highlights

Multi-modal instruction following datasets (GPT-assisted Visual Instruction Data Generation)

Visual Instruction Tuning

Experiment Highlights

Multimodal Chatbot

Conclusion