Cannot reproduce Code Documentation Generation performance

yuewang-cuhk commented 3 years ago

Hi, I've been trying to reproduce the results of Code Documentation Generation but failed to do so. Could you please help to explain how you process the input (directly use the provided tokenized data or manually tokenize using tree_sitter), and how you calculate the smoothed bleu-4 scores? See below for the details:

Take JavaScript for example: the results for CodeTrans-TF-Small/Base/Large reported in the paper are 17.23, 18.25, 18.98, respectively. I first directly employ the tokenized data provided by CodeBERT (or CodeXGLUE), where my reproduced results are 15.8, 16.96, 17.67. Besides, I tokenize the source code using tree_sitter following your provided pipeline (i.e., CodeTrans/prediction/multitask/fine-tuning/function documentation generation/javascript/small_model.ipynb), and the obtained results are 15.28, 16.91, 17.61.

Other facts: I calculate the smoothed bleu-4 score following CodeXGLUE (https://github.com/microsoft/CodeXGLUE/blob/main/Code-Text/code-to-text/evaluator/evaluator.py). I truncate the source and target sequence up to 512 tokens before fed to the model.

We also cannot reproduce the results for other languages on Code Documentation Generation task. Please help to resolve this. Thanks in advance!

agemagician commented 3 years ago

Hi @yuewang-cuhk ,

Thanks for your interest in our work.

In our research, we have used the original T5 inference function to make the prediction: https://github.com/google-research/text-to-text-transfer-transformer#decode Afterward, we used the CodeBert smoothed-BLEU score function to calculate the results: https://github.com/microsoft/CodeBERT/tree/master/CodeBERT/code2nl

Due to the complexity of T5 and slow inference speed, we decided to convert all our models to the hugging-face library, which is much faster and easier for researchers.

The reason for the difference in the smoothed BLEU results, due to the configuration of the beam search, which was used in T5 library compared to the hugging-face library.

In T5, they used beam search 4 and decode alpha 0.6: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/gin/beam_search.gin

To approximately match the same configuration in Hugging-face, you have to adjust the beam search configuration as following:

preds = pipeline(tokenized_input,
                   min_length=1,
                   max_length=1024,
                   num_beams=4,
                   temperature=0,
                   length_penalty=0.6
                  )

Here are the expected results of T5 Vs. Hugging-face for the Javascript Code Documentation Generation:	Library/Model	Small	Base
T5 with beam search	17.23	18.25	18.98
HuggingFace without beam search	15.8	16.96	17.67
HuggingFace with beam search	17.1	18.13	18.94
T5 - HuggingFace difference (using beam search)	0.13	0.12	0.04

As you can see, using the correct beam search configuration, you can approximately match the same result as T5 on HuggingFace. The small insignificant difference between T5 and HuggingFace results due to the different implementation of beam search. For example, HuggingFace calculates the penalty differently than T5 (based on Mesh TensorFlow): https://github.com/huggingface/transformers/blob/996a315e76f6c972c854990e6114226a91bc0a90/src/transformers/generation_beam_search.py#L368 https://github.com/tensorflow/mesh/blob/985151bc4e787be3c99174d0d0eee743a4cb8561/mesh_tensorflow/beam_search.py#L261

I have created three Colab examples that should replicate the above results and reproduce it: https://colab.research.google.com/drive/10PwFRsY8P2uMc3SGr7WRgqQXFxjzbj83?usp=sharing https://colab.research.google.com/drive/1vc84NthgeLNLxOH6eUqbh_5UIuD-Mh4s?usp=sharing https://colab.research.google.com/drive/1YvXt5vYL6HJDPW37tWv9f_r-p-TJfqLs?usp=sharing

Simply following the above examples for the rest of the languages/models, you should be able to reproduce our results.

Regarding preprocessing, you don't need to tokenize the source code using tree_sitter for the CodeBert dataset because it is already preprocessed. You only need to do so if you have a new example that you need to predict.

I hope the above explanation answers your questions.

Out of curiosity, why are you reproducing our results? Are you planning to use it internally in salesforce, or preparing for a new publication, or something else ?

yuewang-cuhk commented 3 years ago

Hi @agemagician, great thanks to your quick and detailed response! We have been able to reproduce the code documentation generation tasks following your instructions. We are planning to compare CodeTrans in our new publication.

By the way, we can only find the provided training set but not the dev and test sets. Could you also kindly share them (tokenized dev and test datasets) to facilitate the easy comparison with your CodeTrans on all downstream tasks? Thanks in advance!

agemagician commented 3 years ago

You are welcome 😃 Sure, we have updated the readme with the datasets links: https://www.dropbox.com/sh/mzxa2dq30gnot29/AABIf7wPxH5Oe0PZHJ5jPV22a?dl=0

Feel free to send me an email or LinkedIn message, if you want to have a discussion over the new publication. I and my co-author @matchlesswei will be happy to discuss it.

yuewang-cuhk commented 3 years ago

Hi @agemagician, thanks for sharing these datasets. I've checked them and confirmed that most of them have the same data statistics in the paper except the "SourceSum" task, where only the training set (ends with "_silvia.tsv") has the matched size. Could you help to check that? I print all the data sizes for files in the "SourceSum" folder below:

6252 testC#
6629 testCS_silvia.tsv
2662 testPython
2659 testPython.txt
2783 testPython_silvia.tsv
2932 testSQL
3340 testSQL_silvia.tsv
49801 trainC#
52943 trainCS_silvia.tsv
11461 trainPython
11458 trainPython.txt
12004 trainPython_silvia.tsv
22492 trainSQL
25671 trainSQL_silvia.tsv
6241 valC#
2647 valPython
2651 valPython.txt
2858 valSQL

matchlesswei commented 3 years ago

@yuewang-cuhk The original dataset for SourceSum is from https://github.com/sriniiyer/codenn/tree/master/data/stackoverflow For the training data, "_silvia.tsv" is the correct one after our preprocessing.

For the testing data, the CodeNN group provided the human annotations for around 100 records for both Csharp and SQL. For example, you could find the sql ones in here: https://github.com/sriniiyer/codenn/tree/master/data/stackoverflow/sql/eval. As described in their paper, the bleu score is calculated for those having the human annotated summaries and the text from stackoverflow. We followed their steps to just evaluate these around 100 records. These are all contained in our tsv files for test. This is mentioned in our paper Page 4:

Iyer et al. (2016) asked human annotators to provide two additional titles for 200 randomly chosen code snippets from the validation and test set for SQL and CSharp code. We followed their preprocessing methods and evaluation using the test dataset annotated by human annotators.

agemagician / CodeTrans

Cannot reproduce Code Documentation Generation performance #4