How to make data like tigerbot-wiki-plugin become instruction-tuning-like format ?

TigerResearch / TigerBot

TigerBot: A multi-language multi-task LLM

https://www.tigerbot.com

Apache License 2.0

2.24k stars 194 forks source link

How to make data like tigerbot-wiki-plugin become instruction-tuning-like format ? #52

Closed twotwoiscute closed 1 year ago

twotwoiscute commented 1 year ago

Hi thanks for the great work, when I browse through the website and find this dataset
tigerbot-wiki-plugin, the keys contain ["content", "wiki_id", "url"] which I believe the only valuable content to learn is "content" part.

I believe the instruction tuning is applied from the way you clean and collect the data. I would like to ask how do you transfer this data into "instruction-tuning-like format" ? Thanks.

i4never commented 1 year ago

Hi. Not all pretrain data needs to become instruction-tuning form since "SFTed" model could generalize instructions. If you focus on specific instructions, one possible way is to write your "seed prompt" (like "Summarize following content") and fetch output from openai model.

chentigerye commented 1 year ago

or from tigerbot api, e.g., "formulate five questions from the above content, and answer them."