prompteus commented 1 year ago

A large part of making the assistant is to teach it to follow instructions. While training using RLHF seems like the main ingredient, there are already prepared supervised instruction-following datasets that might help with data/feedback scarcity. As far as I know, there are two large-scale projects in the area: Promptsource and Natural Instructions. These projects crowdsourced templates and turned supervised NLP datasets into instruction-following datasets.

Promptsource

GitHub: https://github.com/bigscience-workshop/promptsource
project for preparing templates and working with them
they generated a dataset using the templates:
- https://huggingface.co/datasets/bigscience/P3
- https://huggingface.co/datasets/bigscience/xP3 (with multilingual data but English prompt)
- https://huggingface.co/datasets/bigscience/xP3mt (with multilingual data and machine-translated prompt)
they trained zero-shot models (= models for following instructions in the input)
- based on T5 architecture (encoder-decoder) called T0 family (and MT0 for multilingual)
- and based on GPT architecture (decoder-only) called BloomZ family
- Huggingface demo: T0, MT0, BloomZ,
- GitHub repo for T0: https://github.com/bigscience-workshop/t-zero
- GitHub repo for BloomZ and MT0: https://github.com/bigscience-workshop/xmtf
- paper: Multitask Prompted Training Enables Zero-Shot Task Generalization

Natural instructions

GitHub: https://github.com/allenai/natural-instructions
they crowdsource directly the data prepared for instruction following (and learning from a few examples)
the GitHub repo = the dataset. It contains jsons
they trained zero-shot and in-context few-shot models (in multiple sizes):
- mT5 architecture (encoder-decoder, multilingual pretraining)
- Huggingface demo few-shot: https://huggingface.co/allenai/tk-instruct-3b-def-pos
- Huggingface demo zero-shot: https://huggingface.co/allenai/tk-instruct-3b-def
- paper: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Are there any plans to train or pretrain on supervised data?

Alternative use of the projects would be to use the trained models as a starting checkpoint. Maybe there are some other ways to benefit from the projects as well, but pretraining or using the models as an initial checkpoint are the most obvious.

justheuristic commented 1 year ago

include std_i_am_not_a_data_expert.h

Blended skill talk

a dataset of conversations that demonstrate knowledge, empathy or personality
available here: https://huggingface.co/datasets/blended_skill_talk
facebook trained several BlenderBot models on that
- example: https://huggingface.co/facebook/blenderbot_small-90M

huu4ontocord commented 1 year ago

Great ideas! And thank you for the resources.

We can certainly create more like these and reuse some of these. the issue here is that the p3 and natural instructions one are a bit "academic" - sometimes just rephrasing NLI for example. which is not realistic covnersation. so we would need to augment and fix these where we can. otherwise we keep in.

@justheuristic great to see you! yes - the blnderbot and the different datasets alot of chatbots use are close but not exactly what we need. the issue with these is that the answers are sometimes very short and "chit-chat". We want more detailed answers. So we would need to augment some of these answers with generated and/or retreived text to fill in the answers.

If either of you @markcheeky or @justheuristic have ideas on how we can augment these datasets to make them sound more like chatgpt type responses, that would be really cool to.

justheuristic commented 1 year ago

Here's a zero-brain-cell suggestion: once we have a better-than-nothing model (e.g. after supervised finetuning), we could use "knowledge demonstration" as contexts for human annotators (RLHF-style).

So, you take a dialogue up to a certain point, then let the model generate several responses and ask humans to rate which of the responses is more helpful. Still, it looks like not the first priority r/n.

yk commented 1 year ago

@justheuristic nice idea. InstructGPT does this fully with their model, but adding more diverse data to the mix might certainly help.

@markcheeky @justheuristic could you make something like a datasets.md inside docs/ to record these datasets and briefly describe them (essentially copy-paste the information you wrote here)?

christophschuhmann commented 1 year ago

https://github.com/skywalker023/sodaverse https://arxiv.org/abs/2212.10465 We present SODA: the first publicly available, million-scale high-quality social dialogue dataset. Using SODA, we train COSMO: a generalizable conversation agent outperforming previous best-performing agents on both in- and out-of-domain datasets. In contrast to most existing crowdsourced, small-scale dialogue corpora, we distill 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., 2022). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x; West et al., 2022). Human evaluation shows that dialogues in SODA are more consistent, specific, and (surprisingly) natural than prior human-authored datasets - e.g., DailyDialog (Li et al., 2017), BlendedSkillTalk (Smith et al., 2020). In addition, extensive evaluations show that COSMO is significantly more natural and consistent on unseen datasets than best-performing dialogue models - e.g., GODEL (Peng et al., 2022), BlenderBot (Roller et al., 2021), DialoGPT (Zhang et al., 2020). Furthermore, it is sometimes even preferred to the original human-written gold responses. We make our data, models, and code public.

christophschuhmann commented 1 year ago

We should train on summary - fultext - pairs to perform 1.) Reverse Summarization, writing a full text given a summary, and 2.) writing the summary given the ful text

We made a pretty big dataset by merging several others: https://docs.google.com/spreadsheets/d/1DEKeF1kdF3O7e5xQn3O85ibnvvtjxODQb3KXhRybMYk/edit?usp=sharing

Need to upload it.

prompteus commented 1 year ago

@ontocord Yes, I agree, part of the tasks inside promptsource / natural instructions are academic, but it would still teach the model to 1) understand instructions and 2) pick up a lot of abilities. Rephrasing ANLI is definitely not a realistic conversation, but pretraining on it should boost the performance in dialogs that require reasoning.

A potential issue I can see is that it would bias the model to write overly brief responses, but I don't think it's a big deal because the tasks in P3 & NI include generative "reverse" tasks as well - e.g. in the case of the entailment you mentioned, there is also a task: "Given a sentence, write another sentence that is a likely result of it." So I guess the generative tasks (longest-ouput) should compensate classification tasks (shortest output). And in any case, it should be easy to just filter out most of short-output instances from the dataset.

Training on supervised dialog data, as @christophschuhmann suggested, seems like a good idea to me. I think pretraining on a mixture of an instruction-following dataset and a dialog dataset should give a model that can perform many things but still "talk naturally" in a dialog.

@christophschuhmann ad summarization - promptsource and natural instructions also contain tasks for summarization (and its reverse task - text generation given a summary). Is there a reason to focus on this one task in particular? I might be wrong, but It seems to me that focusing on a single task would cause the model to lose generality.

huu4ontocord commented 1 year ago

@markcheeky would you like to help augment some of the p3/xp3 and natural instruction data to improve it, esp. with multi-step dailog? I think that would be useful to get the model to understand instructions. Create a notebook and experiment? I have the p3 data linearlized so you don't have to download from HF. One thing i was hoping we could do is get an LLM to infer explanations from the p3/ni data.

prompteus commented 1 year ago

@ontocord I'm currently having an exam period, so I'm unavailable for the next 3 weeks, but I'd be happy to help with it afterward.

In the meantime, I will think about the best way how to make the data more dialog-like and hope someone comes up with something as well.

Regarding the explanations you mentioned. Nat-Inst. dataset even has some. The data was designed so that each task contains:

instructions
input-output pairs (sampled from the underlying dataset for the task, typically several hundred or a few thousand)
several correct and incorrect input+output pairs called positive/negative demonstrations of the task, which they put into the prompt after instructions during training of tk-Instruct models. Demonstrations also have an attribute with a human-written explanation, but there are only a few demonstrations per task. I would estimate that there are several thousand explanations in the whole Nat-Inst. dataset. In my experience, for many tasks, it does not make sense to explain the output because it is self-explanatory - for example in extractive QA or text generation. Nevertheless, all tasks in Nat. Inst. have a few demonstrations with explanations, so we can use them.

I also found out that promptsource has a hosted interactive data viewer, that might be handy: https://huggingface.co/spaces/bigscience/promptsource Afaik, P3 nor xP3 have any explanations but have other advantages - for example, multiple instructions for the same task (T0 paper authors argue that it reduces over-sensitivity to the wording of the instructions).

prompteus commented 1 year ago

Just found this doc by Yao Fu at Allen AI about the evolution of OpenAI models from GPT-3 to now: How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources

It seems likely that ChatGPT used supervised instruction finetuning before training with RL-HF, suggesting that the supervised data might be useful even without dialog form. Maybe just mixing it with a dialog dataset would give us a good starting checkpoint for RL-HF, even if we don't come up with a scalable way of making the data more dialog-like.

huu4ontocord commented 1 year ago

Yes - supervised instruction finetuning is our first step.. We are doing that right now with the instruction data we are creating. However, to get it to understand longer context, i think we may need to do dialog chains. this is my hypothesis, b.c our models are not large models (maximum 20B). I think supervised dialog instruction finetuning would be helpful. Also, we don't know the "extra" step openai used to turn instructgpt->chatgpt.

huu4ontocord commented 1 year ago

pinging this issue again. Would be good if we could convert some of the p3/xp3 data into instructions. this is a good complement for #560

Also do some algoritmic cleaning up. There are many questions that are just very simple, which we might want to filter out. Maybe also filter out very short answers. And for longer answers, increase the length further by using JT to do completion.

i think we will have good diversity (even if very academic dataset) by the unfiiedqa + p3/xp3.

huu4ontocord commented 1 year ago

@markcheeky - ping

sxthunder commented 1 year ago

Hello, I recently read a collections of instruction tuning papers like Flan, T0, NIV2. And I agree with your idea that a pre-trained model should first instruction tuning with open source dataset, to make it understand how to follow instruction and enhance its zero-shot ability on unseen task. This is the latest flan paper:The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , and I think its the biggest open-source instruction-dataset which contain other datasets like P3 and NIV2, up to 1800+ tasks. here is my zero-brain-cell thought:

it is necessary to finetune a naive pre-training model by dataset-instruct-tuning(using open source dataset with instructions, like Flan and T0). To preliminary learning how to follow a instruction, and improves its zero-shot ability on unseen task.
using some hand-write instruction dataset(like SFT step of instruct-gpt), because instructions in step 1 has some difference in open-ended scenes. People will directly ask chatgpt "how to learning deep learning" without the instruction "ask the following question"
finally using RLFH to align the model output to human

andreaskoepf commented 1 year ago

We followed a 2-stage approach for several of our SFTs: 1st stage: instruction tuning, 2nd stage: high-quality human demonstrations (e.g. OASST dataset).

LAION-AI / Open-Assistant

Supervised data #186

Promptsource

Natural instructions

include std_i_am_not_a_data_expert.h

Blended skill talk