which are the training corpus for supervised method?

ryderling commented 4 years ago

Hi, I have downloaded the search corpus you provided, which results in about 4,679,758 methods. But when we parse these methods, most of them do not have their docstrings (i.e., the natural language descriptions), and there are only 436,450 (less than 10%) methods have their docstrings. Since your paper in https://arxiv.org/pdf/1908.09804.pdf does not provide the statistics about how many methods have their docstrings, do you think it is reasonable or could you provide more information about it?

Other questions that we want to figure out are:

Which dataset do you use to train your supervised baseline methods (like UNIF_android, UNIF_stackoverflow in the paper) and how many samples it contains?
I also observed that many docstrings are non-English (i.e., Chinese), how do you process these docstrings?
Do the training samples you used for training are included in the search corpus we downloaded? If so, is that mean the training dataset shares the same samples with the testing/evaluation dataset?

I am sorry to bother you, but I still hope that you can help me ~ Thank you very much!

xiaokongkong commented 4 years ago

Also, I have a problem, where are the queries for training? Hope the authors would help me. Thank you very much!

celsofranssa commented 4 years ago

A script would be useful to generate the dataset in the format of pairs (code, docstring).

YujiaoWu-111 commented 2 years ago

A script would be useful to generate the dataset in the format of pairs (code, docstring).

@celsofranssa Hi! Where can I find the script? please give me some suggestion.Thank you.

facebookresearch / Neural-Code-Search-Evaluation-Dataset

which are the training corpus for supervised method? #2