Closed twotwoiscute closed 1 year ago
Hi. Not all pretrain data needs to become instruction-tuning form since "SFTed" model could generalize instructions. If you focus on specific instructions, one possible way is to write your "seed prompt" (like "Summarize following content") and fetch output from openai model.
or from tigerbot api, e.g., "formulate five questions from the above content, and answer them."
Hi thanks for the great work, when I browse through the website and find this dataset
tigerbot-wiki-plugin, the keys contain ["content", "wiki_id", "url"] which I believe the only valuable content to learn is "content" part.
I believe the instruction tuning is applied from the way you clean and collect the data. I would like to ask how do you transfer this data into "instruction-tuning-like format" ? Thanks.