danielzuegner / code-transformer

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".
https://www.in.tum.de/daml/code-transformer/
MIT License
166 stars 31 forks source link

finetune to predict docstring instead of func_name #22

Closed who-m-i closed 2 years ago

who-m-i commented 2 years ago

I want to finetune the model to predict docstrings instead of func_name. Can you please help?

tobias-kirschstein commented 2 years ago

Hi,

Thanks for interest in the Code Transformer. We haven't explored predicting the docstrings, but with the framework from the repository you could try it as follows:

  1. The stage 2 dataset files still contain the docstrings but they are not vocabularized. You could either write a script for encoding the words of stage 2 samples as IDs afterwards. Or you could adapt the stage 2 preprocessing to also vocabularize the docstring, pretty much following the way it is done for the tokens in the body. See preprocess-2.py for how vocabularies are built. You would also need a word count for every word that appears in the docstrings to do it properly (which requires a full dataset pass), but that should be doable.
  2. You can adapt the CTCodeSummarizationDataset class to not return the function name as the label, but instead the vocabularized docstring from the stage 2 sample. This should actually simpify the code there a bit because you do not need to care about stripping the label from the method body tokens anymore (as it has to be done for predicting the function name).
  3. The Code Transformer architecture should in principle be capable of predicting longer sequences (such as docstrings). In our case, we restricted the output length to 6 subtokens (see NUM_SUB_TOKENS_METHOD_NAME in constants.py). When predicting docstrings it technically doesn't make much sense to talk about "subtokens" anymore because the output domains will be words instead of programming identifiers (such as my_variable_name). So here it might be necessary to slightly change the Decoder part of the model to not use any subtokens.

Hope this helps. At least, it should lead you in the right direction. Let me know if you have more questions.

Best, Tobias