Summary

LMM의 경우, pre-trained LLM을 기반으로 하기 때문에 LLM의 hallucination 문제가 그대로 이어지는 경우가 많음. 이를 방지하기 위해 hallucination 시나리오를 1) nonexistent object manipulation 2) existent object manipulation 3) knowledge manipulation 3가지로 정의하고 이를 위한 negative instruction을 추가해줌에 따라 hallucination 문제에 보다 robust하게 대응할 수 있는 visual instruction tuning을 제안함. 또한, LMM의 robustness를 보다 잘 평가할 수 있도록 GPT4의 assistance를 받은 visual instruction evaluation 방안인 GAVIE를 제안함. GAVIE는 human annotation & 사전에 특별히 디자인한 format 없이도 human evaluation에 가까운 평가를 내릴 수 있음.

실험을 통해 기존 모델 대비 hallucination에 robust하고 effective한 성능을 보여주고, 기존 LMM의 hallucination이 existent object manipulation이나 knowledge manipulation에서 발생하는 등의 사실을 발견.

Method Highlights

Instruction data generation

positive instruction과 negative instruction을 모두 generate해서 visual instruction tuning을 진행.

Text-only GPT4 활용하여, visual genome의 bounding box, dense caption 등의 이미지에 대한 description을 visual context로 넣어주고, 그 이후에는 instruction을 만들도록 prompting을 해줌.

positive instruction

이미지를 좀 더 다양하게 구성하기 위해, chart 이미지(이때는 visual context를 human-annotated caption 활용)와 news 이미지 (entity가 많이 포함)를 추가해줌.

negative instruction

non-existent object manipulation
- 이미지 내에 존재하지 않는 object에 대해 question 혹은 instruction이 들어오는 경우
- 이미지 내부에 존재하지 않는 object, activities, attributes, interactions에 존재하지 않는 것을 다루는 형태
- GPT4에 reason을 함께 얘기하라고 한 후, reason이 instruction data 내에서 answer가 되는 형태.
existent object manipulation
- 이미지 내 존재하는 object에 대해서 attribute를 mismatch하는 경우를 정의함.
- 2번 시나리오 내 instruction을 non-existent 부분을 existing object ~~ 로 바꾼 형태.
- GPT4에 reason을 함께 얘기하라고 한 후, reason이 instruction data 내에서 answer가 되는 형태.
knowledge manipulation
- 이미지에 등장하는 사람의 이름, 혹은 fact를 왜곡하는 경우
- 기존의 caption에서 entity나 event, keyword를 포함한 knoweldge를 바꾸라고 instruction한 후, 다시 이를 의문문으로 바꾸라고 instruction. → 이렇게 나온 output이 최종적으로 학습할 instruction data에서 instruction이 되는 형태
- 이에 대한 답변으로는 “No. [원래 caption]” 이런 형식으로 바뀌게 됨.
- e.g.) Question: Did the image show the cumulative influenza cases in France by region of infection from March to October 2020? Answer: No. Cumulative COVID-19 cases in Japan by place of infection from April to October 2020".

최종적인 data stat

다른 visual instruction tuning 기법과의 비교

GAVIE

Experiment Highlights

GAVIE에서 evaluation했을 때, LRV-instruction에서 fine-tune된 모델 (Ours)이 가장 좋은 결과.

negative instruction에 대해서도 잘 대응하는 모습.

GAVIE의 경우 human evaluation과 잘 align된 모습.

여러 번의 run에도 어느 정도 stable하게 scoring하는 것을 알 수 있음.

Strengths

task scope도 크고, 많은 이미지와 instance 수 많음.
hallucination 상황을 체계화하고 이에 대한 대응을 위한 negative instruction을 포함한 dataset 구축.
human eval과 비슷하고 flexible한 evaluation protocol을 제안.

Weaknesses

Instruction 생성 시 발생할 수 있는 hallucination에 대한 risk는 여전히 존재할 것 같음.
차트 이미지 같은 non-natural image에 대해서는 caption에 의존해야 하는 부분이 해당 이미지에 대한 llm의 이해도를 떨어뜨릴 것 같음. -> LMM을 활용해보는 것은 어떨까?

YoojLee / paper_review

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning (2024) #76