fani-lab / RePair

Extensible and Configurable Toolkit for Query Refinement Gold Standard Generation Using Transformers
5 stars 5 forks source link

2022 - SIGIR - Another Look at Information Retrieval as Statistical Translation
 #28

Closed yogeswarl closed 1 year ago

yogeswarl commented 1 year ago

Title: Another Look at Information Retrieval as Statistical Translation
 Year: 2022
 venue: SIGIR (Reproducibility paper track) 
link to paper Main problem: 
The authors argue that most of our problems can be solved with a large availability of data. When the noisy channel model has introduced over 2 decades ago, in 1999, it was worked on with synthetic data. 


Output: 
 They successfully reproduce IRST(Information retrieval as statistical translation) as a reranker after ranking with bm25.


Contribution and Motivation:
 The motivation of this paper is to prove that models proposed decades ago can be solved if we just had a larger dataset.

The contributions: 
1. IRST (Sum of translation) reranking after bm25 1st stage retrieval.

  1. IRST (MaxSIM instead of the sum of translation) reranking after bm25 1st stage retrieval

Proposed method: 
The 1999 paper by Berger and Lafferty looks at using statistical translation for information retrieval. The authors of the paper reproduce this method. 
How?

  1. Noisy channel model: The noisy channel model is a method of ad hoc retrieval proposed by Berger and Lafferty in their paper "Information Retrieval as Statistical Translation." It draws an analogy between machine translation and information retrieval, using IBM Model 1 to learn translation probabilities that relate query terms and document terms based on a hidden alignment between the words in the two sentences. The model is based on the concept of a noisy channel, where the information transmitted from the source (query) to the destination (document) is corrupted by noise, and the goal is to recover the original message
  2. MaxSim of Colbert: The MaxSim operator is a part of ColBERT's architecture that helps to calculate the similarity between two pieces of text. For a given query and a set of documents, The MaxSim operator helps to find the most similar document to the query by calculating the maximum similarity score between the query and each document. This score is based on the similarity of the embeddings (a numerical representation of the text) of the query and the document.

Gaps in the work: 
The paper is reproducible. There are no gaps, as the paper argues. In fact, it can be said that while neural models (particularly pretrained transformers) have indeed led to great advances in retrieval effectiveness, the IRST model proposed decades ago is quite effective if provided with sufficient training data.

code: 
Pyserini implementation https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-irst.md