CoCoSoDa: Effective Contrastive Learning for Code Search

Our approach adopts the pre-trained model as the base code/query encoder and optimizes it using multimodal contrastive learning and soft data augmentation.

CoCoSoDa is comprised of the following four components:

Pre-trained code/query encoder captures the semantic information of a code snippet or a natural language query and maps it into a high-dimensional embedding space. as the code/query encoder.
Momentum code/query encoder encodes the samples (code snippets or queries) of current and previous mini-batches to enrich the negative samples.
Soft data augmentation is to dynamically mask or replace some tokens in a sample (code/query) to generate a similar sample as a form of data augmentation.
Multimodal contrastive learning loss function is used as the optimization objective and consists of inter-modal and intra-modal contrastive learning loss. They are used to minimize the distance of the representations of similar samples and maximize the distance of different samples in the embedding space.

Source code

Environment

conda create -n CoCoSoDa python=3.6 -y
conda activate CoCoSoDa
pip install torch==1.10  transformers==4.12.5 seaborn==0.11.2 fast-histogram nltk==3.6.5 networkx==2.5.1 tree_sitter tqdm prettytable gdown more-itertools tensorboardX sklearn

Data

cd dataset
bash get_data.sh

Data statistic is shown in this Table.

PL	Training	Validation	Test	Candidate Codes
Ruby	24,927	1,400	1,261	4,360
JavaScript	58,025	3,885	3,291	13,981
Java	164,923	5,183	10,955	40,347
Go	167,288	7,325	8,122	28,120
PHP	241,241	12,982	14,014	52,660
Python	251,820	13,914	14,918	43,827

It will take about 10min.

Training and Evaualtion

We have uploaded the pre-trained model to huggingface. You can directly download DeepSoftwareAnalytics/CoCoSoDa and fine-tune it.

Pre-training (Optional)

bash run_cocosoda.sh $lang

The optimized model is saved in ./saved_models/cocosoda/. You can upload them to huggingface.

It will take about 3 days.

Fine-tuning

lang=java
bash run_fine_tune.sh $lang

Zero-shot running

lang=python
bash run_zero-shot.sh $lang

Results

The Model Evaluated with MRR

Model	Ruby	Javascript	Go	Python	Java	PHP	Avg.
CoCoSoDa	0.818	0.764	0.921	0.757	0.763	0.703	0.788

Appendix

The description of baselines, addtional experimetal results and discussion are shown in Appendix/Appendix.pdf.

Contact

Feel free to contact Ensheng Shi (enshengshi@qq.com) if you have any further questions or no response to github issue for more than 1 day.

DeepSoftwareAnalytics / CoCoSoDa

readme