Paper Review: On the importance of building high-quality training datasets for neural code search

Publisher

ICSE'22

Link to The Paper

https://dl.acm.org/doi/abs/10.1145/3510003.3510160

Name of The Authors

Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, and Li Li

Year of Publication

2022

Summary

This paper addresses the importance of high-quality training datasets for neural code search models. The authors find that many widely used code search datasets, which are often constructed from code comments and code snippet pairs, contain significant noise and unnatural language in the queries that deviate from real user queries. Over one-third of the queries in a popular GitHub dataset called CodeSearchNet contain noisy features rarely seen in natural queries. This degrades the performance of code search models trained on such data when applied to real-world queries.

To improve dataset quality, the authors propose a two-stage data cleaning framework. First, a rule-based syntactic filter removes queries with invalid features like HTML tags, URLs, non-English text, etc., based on predefined rules. Second, a model-based semantic filter refines the dataset using a Variational Autoencoder (VAE) trained on a small bootstrap corpus of known high-quality natural language queries. The VAE helps identify and retain only semantically valid queries. Experiments show that code search models trained on the filtered datasets have significantly improved performance.

3 RQs: Technique Effectiveness, ablation study on rule-based and semantic filters, and validity/effectiveness of EM-GMM to determine the cut-off of data retention.

Contributions of The Paper

A two-step data cleaning framework for code search datasets, which bridges the gap between code comments and natural user queries, both syntactically and semantically.
Implementation of the framework as a Python library (NLQF) for the code search task in academia and industry.
A comprehensive evaluation of our framework’s effectiveness, which demonstrates significant model performance improvement on three manually annotated validation benchmarks.
The first systematically distilled Github dataset (COFIC) for neural code search, containing over one million comment-code pairs.

Comments

The poor quality of comments in CodeSearchNet provides strong motivation for our proposed fault localization tool.
Using a VAE model may help capture the code data's distribution and identify code snippets that are semantically or syntactically different from others.
Implemented a Bidirectional GRU and an ELBO loss function (cross-entropy loss + KL divergence) for the VAE.
Using EM-GMM to retain data points based on the ELBO loss is an interesting alternative to manually setting the percentage of data to retain.
We reduced the search space of code search manually while maintaining a generalizable approach (Pg. 6).
Advantage demonstrated: Performance still improved even when using only 50% of the data for training and 50% less training time.
The real-world applicability is shown by building a library around the method and releasing a high-quality dataset.

RAISEDAL / RAISEReadingList