Large language models are not zero-shot communicators

Problem statement

LLM (Large Language Model)의 implicature 이해 능력(context 이해) 분석
- implicature를 이해는 context 이해를 수반하는 어려운 태스크이다. (human 대비 zs 24%, fs 9% 차이)
  - Q: Have you senn my phone?
  - Wrong: Yes, I have seen your phone.
  - Correct: Yes, It is under the table.
- scale, in-context examples의 수, prompt 구성 방법, context-light vs. context-heavy 별로 성능을 비교
LLM의 implicature 이해 능력을 측정할 수 있는 방법론 제시

Glossary

implicature: utterances that convey something other than literal meaning

Baseline

Cohere
InstructGPT:
- prompt engineering이 중요하다 -> prompt engineering (user intention)에 잘 반응할 수 있도록 학습해보자
- approach:
- A) GPT-3를 (prompt, answer)로 fine-tuning
- B) 33K prompts에 대한 각 4-9개의 answer를 선호도(user intention)에 맞게 레이블링 후 선호도를 예측하는 Reward Model 학습
- C) A)의 모델을 B)의 Reward Model을 보상함수로 사용하여 강화학습
- 모델에 explicit constraints (e.g. 답변을 두 문장 이내로 작성하시오)가 주어진 상황에서 더 좋은 성능을 보임

Data details

자체 데이터셋 구축 (https://huggingface.co/datasets/UCL-DARK/ludwig)

Experiment Setting

implicature QA에 대해 yes / no 중 1개를 추론하도록 함
prompt sensitivity
- natural prompt e.g. "Esther asked 'Can you come to my party on Friday?' and Juan responded 'I have to work', which means no"
- structured prompt e.g. "Question: Have you found him yet response: They're still looking meaning: no"
effect of k when Few-shot in-context evaluation
Context-light vs. Context-heavy (implicature의 종류)
- Generalised: 쉬움 (context-light), Particularised: 어려움 (context-heavy)

Evaluation

Zero-shot evaluation
- natural vs. structured prompt의 차이를 보았을 때, prompt 구성 방식이 모델 간 경향성에 영향을 주지는 않는 것으로 보임
- 기존 prompt engineering 모델 모델 내에서 prompt로 인한 성능 개선을 노려볼 순 있겠으나, 이것이 모델의 본질적인 behavior에 영향을 주진 않는다.
- InstructGPT는 natural과 structured 사이의 차이가 크지 않다.
- fig 2 Left를 보면, 가장 큰 모델이 가장 높은 성능을 보이는 모델은 InstructGPT, Cohere, T0 3개이다.
- 3개를 제외한 다른 모델은 큰 모델이 작은 모델보다 성능이 나쁜 경우가 있다.
- InstructGPT를 제외한 다른 모델은 scaling을 해도 성능 개선에 한계가 있는 반면 InstructGPT는 scaling에 따라 유의미한 향상을 보인다.
- 기존의 implicature 평가는 비교적 쉬운 context-light implicature로 인해 과대평가된 경향이 있다.
- 모델 스케일/아키텍처에 관계 없이 context 이해가 많이 요구된 particularised보다 context 이해가 크게 필요하지 않은 generalized에서 더 높은 성능을 보인다.
  - Cohere-52B: avg 58.5% (gen: 73.9%, par: 51.5%)
  - InstructGPT-3-175B avg 72.3% (gen: 79.3%, par: 59.7%)
- LLM의 scaling으로 인한 성능 향상은 context resolution이 필요하지 않은 쉬운 examples를 많이 학습한 것 때문으로 보인다.
- In-context examples에서 k가 성능 향상에 유의미하게 영향을 주는데는 한계가 있었다. (대략 k=5 전후)
- k가 prompt sensitivity를 낮춰주는 효과는 있었다.

Conclusion

아키테척 별 큰 차이는 없고, multitask FT와 OpenAI의 InstructGPT가 유의미하게 높은 성능을 보임 (여전히 human avg.보다 현저하게 낮지만)
- scaling과 in-context examples의 수를 늘려도 이 경향성은 크게 달라지지 않는다.
- implicature resolution을 위해선 단순한 task-specific finetuning이 아닌 instruction finetuning이 필요한 것 같다.

Limitations

현재 LLM의 non-binary implicature resolution에 취약할 것으로 예상하고 있으나, 아직 이를 평가할 방법은 제시되지 않았다.

Reference and Implementation:

paper: Large language models are not zero-shot communicators, 10 Sep 2021

bigshanedogg / survey

Large language models are not zero-shot communicators #27

Problem statement

Glossary

Baseline

Data details

Experiment Setting

Evaluation

Conclusion

Limitations

Reference and Implementation: