RedSmallPanda / SSQR

Resources for "Self-Supervised Query Reformulation for Code Search"
2 stars 1 forks source link

Self-Supervised Query Reformulation for Code Search

This repo is for the resources of Self-Supervised Query Reformulation for Code Search. This repo is divided into two parts:

Our SSQR Approach

We propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus. Inspired by pre-trained models, SSQR treats query reformulation as a masked language modeling task over a large-scale unlabelled corpus of queries. SSQR extends T5 (a sequence-to-sequence model based on Transformer) with a new pre-training objective named corrupted query completion (CQC), which randomly masks words from a complete query and asks T5 to predict the masked content. Then, for a given query to be reformulated, SSQR enumerates candidate positions to be expanded and employs the pre-trained T5 model to generate the content to fill the spans. Finally, SSQR selects expansions that have the most information gain.

Results

We pre-train T5 using code comments from the large-scale CODEnn dataset and perform code search experiment on the code search dataset of CodeXGLUE. Our evaluation shows that SSQR significantly outperforms unsupervised baselines and gains competitive performance over supervised methods.