danielzuegner / code-transformer

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".
https://www.in.tum.de/daml/code-transformer/
MIT License
166 stars 31 forks source link

Number of training steps needed? #18

Closed ywen666 closed 3 years ago

ywen666 commented 3 years ago

Thanks for releasing this amazing repo! The documentation is thorough and extremely helpful!

I didn't find the number of training steps or epochs needed in Appendix A.6 in the paper. I am running python -m scripts.run-experiment code_transformer/experiments/code_transformer/code_summarization.yaml (I changed the #layers in yaml file from 1 to 3 according to the appendix in the paper) over 2 days on a single GPU.

I have run for 600k steps and F-1 score in the tensorboard (I guess this is average F-1 score over 4 coding languages?) is around 0.27 (the micro F-1 is 0.33). The number is still a bit off from table 2. I wonder should I just train longer or something is wrong with my training.

tobias-kirschstein commented 3 years ago

Hi Yeming,

Thank you for your interest in the Code Transformer.

Which experiment are you trying to reproduce? The code_transformer/experiments/code_transformer/code_summarization.yaml is just a sample file to show the hyperparameters. If you didn't change anything there, I think it just trains on the Python subset ( filter_language: python ) of the multi-language dataset ( language: 'python,javascript,ruby,go' ). If you get 0.33 micro-F1 then that is actually pretty close to the 34.97 we reported in Table 2 / Python / Ours / F1. So that makes sense to me. You should be able to reproduce the numbers using the hyperparemeter files in code_transformer/experiments/paper We put the hyperparameters for all experiments that we report in the paper there. You can find an overview of these in the README under section 5.

Regarding the number of training steps: This of course depends on the dataset you are using. But usually, the models had the best validation performance when they had seen around 1.5 - 2 million samples (for smaller datasets, Ruby, JavaScript) or up to 4 million samples (for bigger datasets, multi-language, java-small). As we were using gradient accumulation with 128 samples, this would correspond to 150k or 300k gradient updates.

Hope this helps.

ywen666 commented 3 years ago

Oh I see, there is a filter language option in the yaml file. Thanks for the detailed explanation!