XiangLi1999 / PrefixTuning

Prefix-Tuning: Optimizing Continuous Prompts for Generation
867 stars 158 forks source link

GPT 2 prefix tuning. Input data format. #24

Open ManasiPat opened 2 years ago

ManasiPat commented 2 years ago

Hi Lisa,

I saw your video and have read your paper. Great work. I want to try prefix-tuning GPT2 for code summarization task and want to bring my data in right format which can be fed to the code as an input. My data has has pairs of code snippet and the corresponding summary. Can you please guide me to bring it to a right format.

Thank you, Regards, Manasi

XiangLi1999 commented 2 years ago

Hello, Manasi,

Thanks for your interest. I think you could refer to the e2e data as a formatting example, https://raw.githubusercontent.com/XiangLi1999/PrefixTuning/cleaned/data/e2e_data/src1_valid.tx. It's roughly {source} || {target} format and use --mode data2text. In your case it could be print("hello world") || write hello world to standard out

Alternatively, you could customize your own data format by modifying DataCollatorForLanguageModeling and LineByLineTextDataset, and import your custom versions, as I did here: https://github.com/XiangLi1999/PrefixTuning/blob/6519d30e69b15a180f23e2cd41b766d3f62b8e82/gpt2/run_language_modeling.py#L50