The data in the CoSQA dataset

skye95git commented 3 years ago

Hi, about the the CoSQA dataset and how to use it, I have a few questions: 1.The Table4 in the paper shows:

There are 20604 queries and 6276 codes. Why is the number of code and query inconsistent? Is it because one code can answer multiple queries?

2.The paper describes We ﬁx a code database with 6,267 different codes in CoSQA. How to understand it? Do you just want to express that all 6267 codes are different?

3.The CoCLR on Code Search section describes Step 1: download the checkpoint trained on CodeSearchNet. Does the Checkpoint belong to Codebert or CoCLR?

4.The Model Checkpoint section describes You can also use the data in CodeXGLUE code search (WebQueryTest) to train the models by your self. What does the model refer to? The data in the CodeXGLUE/Text-Code/NL-code-search-WebQuery/data/ :

There is only test_webquery.json. It is used for test dataset. How to use it to train model?

skye95git commented 3 years ago

For the question3, I find the description in the paper We initialize CoCLR with microsoft/codebert-base repretrained on CodeSearchNet Python Corpus. How do you repretrained Codebert? I just find the method of Fine-Tune:

Jun-jie-Huang commented 3 years ago

One code may be paired with 2 or more queries. We discuss this issue in data collection part of our paper.
They are indeed different. That means we use all the code in our dataset as the codebase for retrieval.
It belongs to our method. You can change the finetuning command by replacing training and evaluating data to experiment.
You can process train.txt and valid.txt to train your model.

skye95git commented 3 years ago

Thanks for your reply! After your explanation, my understanding is that there are three models in your method:

The first one: your model CoCLR trained on CodeSearchNet. I have a few questions:

I compare the train.txt and valid.txt in CodeXGLUE/Text-Code/NL-code-search-WebQuery/data/ with them in CodeBERT/GraphCodeBERT/codesearch/dataset/python. They're the same. So, if I want to train the models by myself, I can use the CodeSearchNet python corpus directly by processing them into the same format as the code search training data in CoSQA, right?
After replacing the training and evaluating data, is the checkpoint in step1 training command the same as step2?
Are the checkpoint trained on CodeSearchNet and the checkpoint with best code search results just different training data? Is the training the same?

The second one: vanilla model. I have a question:

You describe in this answer: And the model without CoCLR means training with original data to some extend. The vanilla model is the model without CoCLR. What does original data refer to? Does it refer to CoSQA without QRA and IBA augmentation?

The third one: the model with QRA and IBA augmentation. I have a question:

Does the third model differ from the second model in the use of QRA and IBA augmentation?

Is my understanding of the above three models correct?

Jun-jie-Huang commented 3 years ago

The first one:

Yes, that's right.
I think training command for the checkpoint in step 1 is similar to training a Vanilla model. You can change the training files to the processed one.
No, they are not completely same since we didn't apply CoCLR on CodeSearchNet training.

The second one.

Yes

The third one.

Yes

Jun-jie-Huang / CoCLR

The data in the CoSQA dataset #3