facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.73k stars 304 forks source link

How do the passage embeddings use the 'title' of the passage #224

Open jigsaw2212 opened 2 years ago

jigsaw2212 commented 2 years ago

Hi, I want to understand better how the 'title' of the passage is used by the codebase in generating the passage embeddings

xhluca commented 2 years ago

You can see how it's ingested here:

https://github.com/facebookresearch/DPR/blob/d9f3e41bb0087687fa182a4d580711188fd82df9/dpr/models/hf_models.py#L293-L300

Huggingface allows giving pairs of sequences to a tokenizer (e.g. for question answering, NLI, etc.). I believe it usually has a separation token, i.e. {text} [SEP] {text_pair}. In this case, text=title and text_pair=paragraph so it should look like {text} [SEP] {text_pair}, but that depends on the tokenizer to implement it this way ultimately.