[FEATURE] Introduce FollowUp generator

dvsrepo commented 10 months ago

Description

A high impact task for distilabel is one that generates follow up turns or multi-turn dialogues (which then can be criticized/ranked

Given a conversation (or at least a prompt+response pair), the generator will generate a follow up message (from the user role).

Ideally the input would be a standard conversation/list of messages format (like the one we use in uf, zephyr, etc.). This format can be used to build the generator prompt and ask it to generate a follow up message.

This can be developed in parallel or before #130

Open questions:

Should we also build/test a generator that generates not only the follow up message but also the response? My concern is that this is less composable than Instructions -> MultiTurner -> and then use our current pipelines: Response generator -> Labeler.
If we do the above, can we define the number of turns to generate? My concern is to ask too much to this component and end up with Alpaca-level quality (bad) turns.
Is this a good image for the MultiTurner task?

multiturner

dvsrepo commented 10 months ago

I have prototyped a very basic/dirty code approach. Ideally we'd like to get inputs in a conversation format [{"content": ...m "role": "user"}, ...] and loop through that in a jinja template to fill the conversation.

import re
import os
from typing import Any, Dict, List

from distilabel.tasks import TextGenerationTask
from distilabel.tasks.prompt import Prompt
from distilabel.pipeline import Pipeline
from dataclasses import dataclass
from distilabel.llm import OpenAILLM
from datasets import Dataset

multiturner_prompt = """Please read the following conversation between a USER and an AI ASSISTANT and write a follow up message question from the USER.
The follow up question from the user should be highly related to the previous interaction, direct, concise, logically sound, and sometimes challenging for the Assistant.
Avoid superflous text praising the response, giving thanks, and remember users don't waste words giving thanks but are rather very direct with AI assistants.

[USER]
{instruction}
[AI ASSISTANT]
{generation}
[USER]
"""

@dataclass
class MultiTurner(TextGenerationTask):
    system_prompt: str = "You are exceptionally skilled at crafting highly interesting conversations and sometimes challenging conversations between a user and AI assistants"

    def generate_prompt(self, input:Dict[str, str]) -> Prompt:
      formatted_prompt = multiturner_prompt.format(instruction=input["instruction"], generation=input["generation"])
      return Prompt(
            system_prompt=self.system_prompt,
            formatted_prompt=formatted_prompt
          )

    def parse_output(self, output: str) -> List[Dict[str, str]]:
      return {"generations": output}

generator = OpenAILLM(
    model="gpt-3.5-turbo",
    task=MultiTurner(),
    max_new_tokens=1024,
    num_threads=4,
    openai_api_key=os.getenv("OPENAI_API_KEY", None),
    temperature=0.7
)
pipe = Pipeline(generator=generator)

from datasets import load_dataset

dataset = load_dataset("argilla/ultrafeedback-binarized-preferences-cleaned", split="train")

def generate_input(r):
  return {
      "input": {
          "instruction": r["chosen"][0],
          "generation": r["chosen"][1],
      }
  }
dataset = dataset.select(range(10)).map(generate_input)

This generates relatively good follow up questions (looking at a small sample)

alvarobartt commented 10 months ago

Hi @dvsrepo! Thanks for the detailed issue 🤗 I have one doubt w.r.t. naming of the task, whenever you say multi-turn you mean to allow the task to receive a list of assistant-human interactions and fill-in the next one in the sequence, or to chain those to generate N-turns from the one provided? Because the first one seems feasible and could be easily integrated anytime, but if the second one implies re-using the generated content to generate more and chain that sequentially N-times, that may be more complex with the current approach, but we can talk about it!

dvsrepo commented 10 months ago

hi @alvarobartt this is discussed in open question 2.

I'd like to start with generating one more user message but the usage of this component is to build multi-turn datasets, even if it means running this pipeline several times or chained in combination with a response generation task.

The current approach is just visually show what I have in mind is not intended to cover open questions. As mentioned in 2. generating full multi-turns will have an impact on quality and will be more complex as you highlight

I don't care much about the name, for me multi-turn expresses the final utility of what can be achieved with this component, but we can change it FollowUp or something like that.

dvsrepo commented 10 months ago

This is my working hacky example with UltraFeedback:

1. Generate dataset with follow up questions

If this task returns conversations in the OpenAI conversation format, we could chain this easier with the response generation pipeline and potentially with more rounds of follow up generation (to generate several turns). Something like

from datasets import load_dataset
import re
import os
from typing import Any, Dict, List

from distilabel.tasks import TextGenerationTask
from distilabel.tasks.prompt import Prompt
from distilabel.pipeline import Pipeline
from dataclasses import dataclass
from distilabel.llm import OpenAILLM
from datasets import Dataset

# I think we could define this as a jinja template
# and use the OpenAI chat format to render the conversation (containing arbitrary turns)
multiturner_prompt = """Please read the following conversation between a USER and an AI ASSISTANT and write a follow up message question from the USER.
The follow up question from the user should be highly related to the previous interaction, direct, concise, logically sound, and sometimes challenging for the Assistant.
Avoid superflous text praising the response, giving thanks, and remember users don't waste words giving thanks but are rather very direct with AI assistants.

[user]
{instruction}
[assistant]
{generation}
[user]
"""

@dataclass
class MultiTurner(TextGenerationTask):
    system_prompt: str = "You are exceptionally skilled at crafting highly interesting conversations and sometimes challenging conversations between a user and AI assistants"

    # if this would accept the OpenAI chat format it would be awesome
    # so it's more chainable
    def generate_prompt(self, input:Dict[str, str]) -> Prompt:
      formatted_prompt = multiturner_prompt.format(instruction=input["instruction"], generation=input["generation"])
      return Prompt(
            system_prompt=self.system_prompt,
            formatted_prompt=formatted_prompt
          )
    # should this return the OpenAI format too?
    # with the follow up message as a new message
    def parse_output(self, output: str) -> List[Dict[str, str]]:
      return {"generations": output}

dataset = load_dataset("argilla/ultrafeedback-binarized-preferences-cleaned", split="train")

generator = OpenAILLM(
    model="gpt-3.5-turbo",
    task=MultiTurner(),
    max_new_tokens=1024,
    num_threads=8,
    openai_api_key=os.getenv("OPENAI_API_KEY", None),
    temperature=0.7
)
pipe = Pipeline(generator=generator)

def generate_input(r):
  return {
      "input": {
          "instruction": r["chosen"][0],
          "generation": r["chosen"][1],
      }
  }
dataset = dataset.shuffle().select(range(6000)).map(generate_input)

generated_ds = pipe.generate(dataset=dataset)

2. Generate dataset with responses to follow up questions using LLMPool


def make_input(r):
    input = []
    for message in r["chosen"]:
        input.append(f"[{message['role']}]")
        input.append(f"{message['content']}")
    input.append(f"[user]\n{r['followup'][0]}")
    input.append("[assistant]\n")
    return {"input": "\n".join(input)}

# this is pretty useless, I don't know why I did it like this
# if we could leverage a standard format as output of the previous pipeline that would be cool
ds = ds.filter(lambda r: r["generations"] is not None).rename_columns({"generations":"followup"}).map(make_input)

def load_gpt3(task):
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-3.5-turbo",
        task=task,
        openai_api_key=os.getenv("OPENAI_API_KEY"),
        max_new_tokens=1024,
        num_threads=8,
        temperature=1.0
    )

def load_gpt4(task):
    from distilabel.llm import OpenAILLM

    return OpenAILLM(
        model="gpt-4",
        task=task,
        openai_api_key=os.getenv("OPENAI_API_KEY"),
        max_new_tokens=1024,
        num_threads=8,
        temperature=1.0
    )

generator = LLMPool(
    [
        ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt4),
        ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_gpt3),
    ]
)

pipeline = Pipeline(generator=generator)
generated = pipeline.generate(dataset=ds.select(range(100)), num_generations=2, batch_size=1)

3. Generate preference dataset

we might want to add a tuned version of ultrafeedback to clearly indicate there's a chat history and the labeler should focus on the last response to the interaction.

from distilabel.tasks.preference.ultrafeedback import UltraFeedbackTask
from distilabel.llm.openai import OpenAILLM

task = UltraFeedbackTask.for_text_quality(
    task_description="\n# General Response Quality and Accuracy Assessment\nEvaluate the assistant's outputs based on various criteria:\n1. **Correctness & Informativeness**: Does the output provide accurate and helpful information?\n2. **Honesty & Uncertainty**: How confidently does the assistant convey its information, and does it express uncertainty appropriately?\n3. **Truthfulness & Hallucination**: Does the assistant introduce misleading or fabricated details?\n4. **Instruction Following**: Does the assistant's output align with given instructions and the user's intent?\nYour role is to provide a holistic assessment considering all the above factors focusing only on the response to the last question of the [user]. Use the full conversation only for context but focus on rating what response is better and more appropriate. Even if they are both almost correct please highlight the differences in the rating and the rationale.\n\n**Scoring**: Rate outputs from 1 to 5 based on the overall quality, providing a single number not 5/5 or something similar, considering all aspects:\n"
)

labeller = OpenAILLM(
        model="gpt-4",
        task=task,
        openai_api_key=os.getenv("OPENAI_API_KEY"),
        max_new_tokens=1024,
        num_threads=8
)

pref_pipe = Pipeline(
    labeller=labeller
)

labelled2 = pref_pipe.generate(dataset=generated.select(range(5)), num_generations=2, batch_size=4)

dvsrepo commented 10 months ago

@alvarobartt and @plaguss I have included my full steps for the PoC, tons of improvements, especially for defining inputs and outputs in a way that is easier to chain these pipelines.

alvarobartt commented 10 months ago

Hi here! I've discussed about some potential improvements w.r.t. how the Prompt is defined, and also w.r.t. defining responsibilities across the different classes, in order to move some LLM-specific stuff to the LLMs, while simplifying the Prompt dataclass and providing some formatting helpers. That said, most likely we'll end up with chat or instruct formats and functions to prepare those.

Finally, regarding the variable naming and chaining we should define which are the pain points as of now, and what's a nice way to tackle those with minimal impact.

cc @gabrielmbmb

argilla-io / distilabel