[DPR] Dense Passage Retrieval for Open-Domain Question Answering

Problem statement

적은 양의 (question, passage) pair로 retrieval 성능을 낼 수 있는 training scheme을 찾는다.
question과 passage의 내적으로 유사도를 비교할 수 있는 low-dimensional & continuous space에 임베딩한다.

Baseline

TF-IDF / BM25:
- 속도 측면에서 가장 feasible한 방법 & 학습 필요 없음
- 키워드 기반 매칭이기 때문에 발생하는 문제점
ORQA:
- dense retrieval로 BM25보다 높은 성능을 보임
- ICT pretraining의 연산 비용이 큼
- Question & Answer pair로 fine-tuning하지 않았기 때문에 representation이 sub-optimal이다.

Data details

name	abbr	type	format	source
Natural Questions	NQ	text	Wikipedia	open-domain question answering
TriviaQA	text	Web(scraped)	question answering
WebQuestions	WQ	text	Freebase	question answering
CuratedTREC	TREC	text	Web	open-domain question answering
SQuAD v1.1	text	Wikipedia	question answering

Passages 전처리
- Wikipedia
  - DrQA의 전처리 방법으로 Wikipedia 덤프의 text 부분만 가공
  - 100 단어 기준을 1단위로 나누어 21,015,324개의 passages를 생성
- Passages가 매핑되어 있지 않은 경우 (CuratedTREC, WebQuestions, TriviaQA) 혹은 단위가 달라진 경우 (Squad, Natural Questions)
  - answer와 BM25 스코어가 가장 높은 passage를 할당

Approach

유사도 함수는 passages를 미리 연산해놓을 수 있도록 cosine, mahalonobis와 같은 decomposable한 함수 중 하나이며, 모델이 encoder를 더 잘(쉽게) 학습시킬 수 있도록 간단한 inner product를 사용한다.
- $sim(q, p) = {E_Q}(q)^T{E_P}(p)$
- dot-product similarity로 encoder를 학습시키는 metric learning
positive sample 1개 + BM25로 뽑은 negative sample 1개 + batch 내에서 Gold로 뽑은 negative samples $n-2$개를 사용한다.
- metric learning에서 negative samples을 고르는 방법은 representation의 quality에 큰 영향을 미친다.
- negative sampling 방법 후보들
  - Random
  - BM25: question과 token overlap이 제일 적은 passages를 사용
  - Gold: training sample 내에 다른 pair의 positive sample들을 사용
Encoder는 2개의 BERT를 사용하며, inference 시에는 FAISS (open source for similarity search & clustering of dense vectors)를 사용한다. ~Encoder weights를 공유하는 Siamese 구조 아님~
학습
- retriever는 각 데이터셋 별로 학습한 것과 멀티 데이터셋에 대해 모두 학습한 것 중 멀티 데이터셋을 사용한다.
- Top 2000개의 passages와 스코어를 각각 BM25와 DPR로 뽑고, $BM25(q,p) + \lambda * sim(q,p)$로 다시 정렬한다.
- open-domain QA에서 retrieval & refine 구조를 차용할 때, retrieval의 recall뿐만 아니라, precision도 실제 E2E QA 정확도에 영향을 끼친다.
E2E QA System(Retriever + Re-ranker)의 Re-ranker
- Retriever가 million 단위의 문서 중 일부 문서를 반환하면, re-ranker가 이 중 passages에서 spanning을 통해 score를 재산정함. dual-encoder 구조보다 느리지만 성능이 좋은 cross-encoder 방식을 차용함
- $score = avg({P{{start}, {i}}}(s) * {P{{end}, {i}}}(t), {P_{selected}}(i))$

Evaluation

BM25와 DPR의 차이는 k가 적을 때 더 많이 났다 => ${DPR{r@20}} - {BM25{r@20}} > {DPR{r@100}} - {BM25{r@100}}$
multi-dataset retriever의 경우, 데이터셋 사이즈가 작은 TREC의 경우, 사이즈가 큰 NQ나 WebQuestions보다 더 많은 상승이 있다.
Squad에서 성능이 낮은 이유에 대한 추정
- 지문을 보고 Q&A를 annotation했기 때문에 단어 오버랩이 많을 수 있다.
- 500개 가량의 Wikipedia가 대상이기 때문에 학습 데이터의 분포가 편향되어 있을 수 있다.
dense retrieval은 많은 양의 데이터가 필요했던 기존과 다르게 DPR은 1K examples로 BM25를 넘었고, examples의 증가에 따라 consistent improvement를 보인다.
in-batch negative sampling은 Gold일 때, batch size가 클 때 더 성능이 좋았으며, 추가로 BM25를 hard negative로 넣을 경우 성능이 개선되었다.
(예상가능하게도) 질적 분석에서 BM25는 keywords에 민감했으며, DPR은 lexical variations와 semantic relationships를 더 잘 파악한다.
DPR은 generalization이 잘되어서 추가 fine-tuning 없이 다른 데이터셋에도 준수한 성능을 보임 (3-5%P 드랍)
- WebQuestions: 75.0(ft) vs. 69.9(DPR), TREC: 89.1(ft) vs. 86.3(DPR)
E2E QA results: minor normalization 후 reference answer와 exact match로 점수 산정
- minor normalization: Latent retrieval for weakly supervised open domain question answering. In Association for Computational Linguistics
(예상가능하게도) retriever acc이 높을수록, 최종 QA 결과도 더 좋음
multi-dataset 학습은 WQ와 TREC 같이 볼륨이 작은 데이터셋을 평가할 시 더 이득을 많이 본다
Retriever가 반환하는 passages의 숫자인 $k$는 데이터셋마다 다르다.

Ablation & Tricks

decomposable similarity functions 비교: dot prdout $\approx$ L2 $>$ cosine $\approx$ triplet

Limitations

제안된 구조의 약점

참고 & 개인적인 생각

성능 비교 시 Cross-encoder > (Poly-encoder 등의 hybrid 방식) > Bi-encoder 순으로 주로 묘사되곤 한다.
2020년에 나온 논문인데, Dialogue Retrieval 쪽에서는 2021년까지도 Cross-encoder만 사용했다. 대화는 input context가 여러 개의 발화로 이루어져 있고, 각각의 candidate response와의 어텐션이 중요해서 Bi-encoder로는 충분한 학습이 안되어서 일까?
sparse vector가 아닌 dense vector를 검색에 이용한 것, 혹은 Bi-encoder(or Dual-encoder, or 2-Tower model)의 구조를 사용한 것 자체가 novel approach는 아닌 것 같다. 아마 contribution은 이 모두를 결합해 Open-domain QA를 할 수 있는 프레임워크를 제시했고, 그게 dense vector representation에는 많은 양의 데이터가 필요하다는 기존 인식과 다르게 1000 examples의 학습만으로 기존(BM25 기반)보다 높은 성능이었다는 것이 주효했던 것 같다.

bigshanedogg / survey

[DPR] Dense Passage Retrieval for Open-Domain Question Answering #16

Problem statement

Baseline

Data details

Approach

Evaluation

Ablation & Tricks

Limitations

참고 & 개인적인 생각