CEDR: Contextualized Embeddings for Document Ranking

Abstract

ad-hoc document ranking 모델에 들어가는 input에 관한 representation 을 잘하겠다.!!
- 기존에는 보통 w2v
HOW > 요즘 유명한 BERT or ElMo
- w2v > BERT로...
이러한 embedding기법을 기존 neural ranking architectures 구조에 적용하여 SOTA를 달성.
- 2019-12-18현재 : SOTA

METHODOLOGY

Notation

Q, D > query term으로 구성
similarity matrix
- 이는 DRMM, KNRM등을 그대로 이용하겠다는 것을 의미

Contextualized similarity tensors

앞에서 언급한 input에 관한 representation 즉, 기존 w2v이 아닌 "context sensitive" embedding 기법을 이용하겠다는것.
그게 바로 BERT & ElMo *즉, 이전의 w2v이나 GloVe는 단하나의 word에 대한 representation 기법
예를들어, "bank deposit" 와 "river bank"에서 word "bank"는 다르게 표현되어야 하는데 기존 기법에서는 같았다.
contextualized language model
- 여러 layer(multiple stacked layers)의 표현(representations) 방식으로 구성. (= recurrent or transformer outputs)
이 개념으로 Q & D 사이의 similarity metric를
- L is the number of layers in the model
  Joint BERT approach
BERT는 (ElMo와 달리) 여러개의 text segments(여러 문장단위)에 대해 처리 가능(encoding)
- text pairs에 대한 judgments 이 가능 ??? > 가능!!
- 이는 두개의 meta-tokens : [SEP] & [CLS] 과 text segment embeddings (Segment A & Segment B)에 의해 가능 > 밑의 그림 참고
- [SEP] 는 segment들을 분리
- [CLS] 는 text pairs를 judgments하기 위해 사용(classification??) > 두문장이 sequential한 여부 판단...(같은 의미를 표현하는 pair인지..판단??)
- 이러한 특성들은 BERT가 fine-tuned 함으로써 Neural ranking 모델에 좀 나은 이익을 부가할수 있음.
- 살짝 이해안가는데... > [CLS] vector 를 포함시킴
w2v에서 bert 기반 vector로 대체 이때, bert를 finetuning한 bert를 적용시키면 좀더(..contextualized 특성이 강해지므로) 향상시킬수 있다는 의미
- 살짝복잡하게..설명해서...와닿지 않는데..좀더 자세히 읽어봐야,,
보다 확실하게, (소스를 보니) : update-2020-01-15
- 실제로, BERT-IR같은 알고리즘(예-링크)를 적용하여 fine-tuning
- 여기서 BERT-IR은 (query, doc ) pair를 입력값으로 주면, [CLS][quey][SEP][doc][SEP] 관련 vector가 나옴.
- 다시 이 vector에서, [quey] & [doc] 만 추출하여, 기존 neural ranking(DRMM, KNRM)알고리즘에 적용하여 다시 학습!!!
  - 자세히면, 항상 (query, doc ) pair 를 줘야함. 이게 나의 고민임!!!(실용성적으로 매우..의미가 없는 형태가 아닌가??!!)

EXPERIMENT

Datasets
- Trec Robust 2004 and WebTrack 2012–14
Models
- PACRR, KNRM, DRMM 적용 > 논문에서 별다른 언급이 없는 걸로보아.. 기존 연구의 알고리즘을 그대로 사용하고, input값만 bert를 이용한것같음 > 사실 이게 요지.!!
Contextualized language models
- pretrained
  - ELMo (Original, 5.5B)
  - BERT (BERT-Base, Uncased) language models
    - we encode the query (Segment A) and document (Segment B) simultaneously. 요말이 핵심인것 같다. > 사실 이전 문장들을 읽다가 이 아이디어를 생각했는데..^^;
      - 요렇게 fine-tuning해서 랭킹모델에 적용하는듯하다.
    - query, doc 학습이 가능하다면, 기존 랭킹 모델 빼고 독자적으로..사용가능하지 않을까도 생각되어 지는데..
    - limited to 512 tokens
실험 결과

chullhwan-song / Reading-Paper

CEDR: Contextualized Embeddings for Document Ranking #262

Abstract

METHODOLOGY

Notation

Contextualized similarity tensors

Joint BERT approach

EXPERIMENT