Open ManasiPat opened 2 years ago
Hello, Manasi,
Thanks for your interest. I think you could refer to the e2e data as a formatting example, https://raw.githubusercontent.com/XiangLi1999/PrefixTuning/cleaned/data/e2e_data/src1_valid.tx. It's roughly {source} || {target} format and use --mode data2text. In your case it could be print("hello world") || write hello world to standard out
Alternatively, you could customize your own data format by modifying DataCollatorForLanguageModeling and LineByLineTextDataset, and import your custom versions, as I did here: https://github.com/XiangLi1999/PrefixTuning/blob/6519d30e69b15a180f23e2cd41b766d3f62b8e82/gpt2/run_language_modeling.py#L50
Hi Lisa,
I saw your video and have read your paper. Great work. I want to try prefix-tuning GPT2 for code summarization task and want to bring my data in right format which can be fed to the code as an input. My data has has pairs of code snippet and the corresponding summary. Can you please guide me to bring it to a right format.
Thank you, Regards, Manasi