Models / Tokenizers - Githubissues

jose commented 1 year ago

Hi @NougatCA,

I'm trying to understand from where did you get the models/tokenizers and here is the breakdown of all models evaluated in the empirical study and listed in the pre-print.

Model	HuggingFace	Repository
PTM-NL
RoBERTa [7]	roberta-base	facebookresearch/fairseq
GPT-2 [9]	gpt2	openai/gpt-2
BART [11]	facebook/bart-base	facebookresearch/bart
T5 [10]	t5-base	google-research/text-to-text-transfer-transformer

PTM-C
CuBERT [12]	(available elsewhere)	google-research/cubert
GPT-C [14]	(available elsewhere)	---
C-BERT [13]	(available elsewhere)	---
JavaBERT [15]	CAUKiel/JavaBERT	cau-se/gh-archive-code-retrieval
CodeGPT-adapted [35]	microsoft/CodeGPT-small-java-adaptedGPT2	microsoft/CodeXGLUE
DeepDebug [16]	niuca/DeepDebug (not official)	---

CodePTM
CodeBERT [53]	microsoft/codebert-base	microsoft/CodeBERT
GraphCodeBERT [54]	microsoft/graphcodebert-base	microsoft/GraphCodeBERT
CugLM [40]	(available elsewhere)	LiuFang816/CugLM
DOBF [55]	(available elsewhere)	facebookresearch/CodeGen
T5-learning [56]	niuca/T5-learning (not official) (or available elsewhere)	antonio-mastropaolo/T5-learning-ICSE_2021
PLBART [57]	uclanlp/plbart-base	wasiahmad/PLBART
ProphetNet-Code [58]	(available elsewhere)	microsoft/ProphetNet_Code
CoTexT [59]	razent/cotext-2-cc	justinphan3110/CoTexT
TreeBERT [28]	(available elsewhere)	17385/TreeBERT
CodeT5 [62]	Salesforce/codet5-base	salesforce/CodeT5
SynCoBERT [63]	???	---
SPT-Code [29]	(available elsewhere)	NougatCA/SPT-Code
UniXcoder [64]	microsoft/unixcoder-base	microsoft/UniXcoder

(Note: SCELMo [52], OSCAR [60], and CodeDisen [61] have been excluded for several reasons. See the pre-print for more details.)

Questions/Comments regarding the table above:

Regarding the GPT-2 [9] model, why did you use the distilgpt2 model instead of the gpt2?
According to the pre-print, there was no pre-trained model neither source code of GPT-C [14], C-BERT [13], and DeepDebug [16]. Thus, you re-implemented and pre-trained all of them according to the settings (e.g., tokenizer, hyperparameters, and dataset) described in the original papers. Those are kindly provided by you in here, thanks for that.
According to the pre-print, there was no pre-trained model neither source code of SynCoBERT [63] and therefore you re-implemented and pre-trained it as described in the original paper. Did you by any chance forgot to include SynCoBERT in the zip file you kindly provided here?

-- Best, Jose

NougatCA commented 1 year ago

Hi @jose,

We did some exploratory experiments using gpt2 and distilgpt2 and found that the performance of both was similar, so we used the smaller size latter for efficiency reasons.
As for SynCoBERT, I am very sorry if it is not included in the zip file I provided, it is an oversight on our part. Unfortunately, since I'm visiting abroad right now and I saved the model on my desktop in China, I can't get the model at the moment. I will update SynCoBERT as soon as it is available and upload the model to HuggingFace.
Here are the links to CugLM, and DOBF.

jose commented 1 year ago

Thanks @NougatCA,

We did some exploratory experiments using gpt2 and distilgpt2 and found that the performance of both was similar, so we used the smaller size latter for efficiency reasons.

Which one is the smallest one, distilgpt2 or gpt2?

As for SynCoBERT, I am very sorry if it is not included in the zip file I provided, it is an oversight on our part. Unfortunately, since I'm visiting abroad right now and I saved the model on my desktop in China, I can't get the model at the moment. I will update SynCoBERT as soon as it is available and upload the model to HuggingFace.

Any chance you could provide a date of when would the model/code be available? Thanks in advance.

Here are the links to CugLM, and DOBF.

Thanks, I've just updated the table.

NougatCA / FineTuner

Models / Tokenizers #9