ECIR2021.SQE-GAN: A Supervised Query Expansion Scheme via GAN

Why did I choose this paper? Because this paper uses GAN for the task of query expansion which is one of the IR tasks that is strongly related to my research.

Main problem:

The main problem of this paper is to find a solution to make the existing methods for query expansion faster and more accurate. Query Expansion (QE) is defined as adding new terms to an input query by a user to make it more precise in order to fulfill the users' needs.

Existing work:

Existing works on the QE can be divided into two categories:

Unsupervised Query Expansion (UQE)
- Example: Many classical algorithms
- Probability models
- Relevance-based language models
- Disadvantage: noisy or even harmful
Supervised Query Expansion (SQE)
- Example: state-of-the-art in the QE literature
- random walk
- term dependency-based approach
- boosting approach
- Learning-based approaches
- Disadvantage: higher response time (because of the feature extraction phase)

Inputs:

A set of original queries {q1, ..., qn},
A set of expanded terms {t1, ..., tM }

Outputs:

top k relevant terms to query q from the candidate ones.

Method:

Idea:

Response time: word embedding can be used to avoid the feature extraction phase (time-consuming part) in SQE
Performance:Deep learning can be used to encode the correlation between an arbitrary pair of query and expanded term

Steps:

Using UQE to get expanded terms
word embedding technique is used to transform the terms into vectors
Using GAN, both generative and discriminative models iteratively optimize each other
Generator re-ranks the expanded terms
Discriminator calculates the score for the new ranking

Experimental Setup:

Dataset:

TREC Robust 2004
- 528,000 high-quality documents
- 250 queries for experiments

Preprocessing:

Stemming: (Porter)
Stopwords removal: (standard InQuery)
Word embedding: Word2Vec’s Continuous Bag-of-Words (CBOW) (d=100)
Initial expanded terms: 100 terms generated by UQE
Final selected terms: 20 terms (of 100 initial terms)
40% train, 10% validate, 50% test
Basic retrieval model: TFIDF

Metrics:

MAP (Mean Average Precision)
Precision@k (k = [5, 10])
NDCG

Baselines:

state-of-the-art SQE scheme, SQE-TFS response time
traditional UQE, KL divergence E2
UQE retrieval effect
RankSVM retrieval effect
Deep NN retrieval effect
Sequence to sequence learning retrieval effect
BiLSTM
Query-to-Term Attention

Results:

The main contribution of this paper is to propose a fast and high-performance model for QE problem. Results show that the major contribution is in the response time (37% improvement) by removing the feature extraction phase and adding a word-embedding module instead. In addition, SQL-GAN also improves the result compared with the latest deep learning-based QE solutions.

Code:

The code of this paper is unavailable.

Presentation:

There is no available presentation for this paper.

fani-lab / SEERa