GPT 2 prefix tuning. Input data format.

XiangLi1999 / PrefixTuning

Prefix-Tuning: Optimizing Continuous Prompts for Generation

867 stars 158 forks source link

Hello, Manasi,

Thanks for your interest. I think you could refer to the e2e data as a formatting example, https://raw.githubusercontent.com/XiangLi1999/PrefixTuning/cleaned/data/e2e_data/src1_valid.tx. It's roughly {source} || {target} format and use --mode data2text. In your case it could be print("hello world") || write hello world to standard out

Alternatively, you could customize your own data format by modifying DataCollatorForLanguageModeling and LineByLineTextDataset, and import your custom versions, as I did here: https://github.com/XiangLi1999/PrefixTuning/blob/6519d30e69b15a180f23e2cd41b766d3f62b8e82/gpt2/run_language_modeling.py#L50

XiangLi1999 / PrefixTuning