agemagician / CodeTrans

Pretrained Language Models for Source code
MIT License
248 stars 32 forks source link

Any chance of actual pretraining/finetununing code? #7

Closed StrangeTcy closed 2 years ago

StrangeTcy commented 2 years ago

A lot of jupyter notebooks with pipelines for the tasks your model can perform is great, but it would also be nice to have a finetuning script. Ideally it would be a slight modification of the transformers run_mlm.py, but a custom script should suffice.

rdurelli commented 2 years ago

Could you get the pretraining/finetununing code?

Thanks

diegocolombodias commented 2 years ago

It would be great!

agemagician commented 2 years ago

Hi all,

Thanks for your interest in our work.

For pretraining, we have used Text-To-Text library. The training script could be found in our shared dropbox folder, which we added to the main readme file: https://www.dropbox.com/sh/488bq2of10r4wvw/AABrmE2V8lc8tRqV-qtLu4pUa?dl=0

for example, you can download the training script for the pertaining self-supervised learning for the small model here "t5_code_tasks_colab.py": https://www.dropbox.com/sh/488bq2of10r4wvw/AAAom4_8kfd3NgFD1SV0HqnBa/transfer_learning/pre_training/small?dl=0&subfolder_nav_tracking=1

However, many changes happened to the text-to-text library, and the script will need to be updated to work with it. Unfortunately, we don't have the capacity to do that ourselves.

For fine-tuning, it should be straightforward by using HuggingFace Seq2SeqTrainer. I have created an example for fine-tuning our base model on the code translation dataset from/to cs/java: https://colab.research.google.com/drive/1krt-x0LmdkuoTeHDzZkjLwLN2Vk_rNmF?usp=sharing

The above is only a base example, and you need to play with it to get the best results, including but not limited to:

  1. Changing the base model with the large.
  2. Hyperparameter tuning: learning rate, training epoch, etc.
  3. dataset cleaning and preprocessing.