argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.64k stars 129 forks source link

Allow mixing different tasks in an `LLMPool` #207

Closed plaguss closed 10 months ago

plaguss commented 10 months ago

Is your feature request related to a problem? Please describe.

When generating data with an LLMPool we could use different (but similar) tasks depending on the LLM:

pool_generation_task = TextGenerationTask(system_prompt="...")

llm_pool = LLMPool(
    [
        ProcessLLM(task=pool_generation_task, load_llm_fn=load_notus),
        ProcessLLM(task=MagicoderTask(), load_llm_fn=load_magicoder),
        ProcessLLM(task=pool_generation_task, load_llm_fn=load_starling),
    ]
)

As of now, the following check in llm/base.py prevents the previous snippet to work.

image

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered Avoid mixing tasks.

Additional context As discussed with @gabrielmbmb internally, a solution could be the following:

Instead of checking that the tasks of the LLMs provided to the LLMPool are the same we could check that their output_args_names are the same

alvarobartt commented 10 months ago

Hmm AFAIK that can already be done if we change task args, so we can already achieve the scenario where 3 LLMs are generating stuff for different contexts as the TextGenerationTask is abstract and we can freely add any instruction for the generation, so not sure this is a needed feature, also taking into consideration that it can already be achieved with the minimal implementation of TextGenerationTask, so could you please elaborate a bit on the scenarios and cons/limitations of the current approach? Thanks in advance 🤗

plaguss commented 10 months ago

Hmm AFAIK that can already be done if we change task args, so we can already achieve the scenario where 3 LLMs are generating stuff for different contexts as the TextGenerationTask is abstract and we can freely add any instruction for the generation, so not sure this is a needed feature, also taking into consideration that it can already be achieved with the minimal implementation of TextGenerationTask, so could you please elaborate a bit on the scenarios and cons/limitations of the current approach? Thanks in advance 🤗

Sure, in the pipeline from the example, the LLMPool has 3 different tasks, and the following check done internally (the one in the screenshot) prevents that pool to work.

>>> MagicoderTask() == TextGenerationTask(system_prompt=SYSTEM_PROMPT_MAGICODER)
False

Commenting out that piece of code, it works fine, but as discussed with @gabrielmbmb it makes sense to at least check all the tasks returned at least the same output_args_names.

Maybe I missed a more direct way of tackling that.

alvarobartt commented 10 months ago

Sure, in the pipeline from the example, the LLMPool has 3 different tasks, and the following check done internally (the one in the screenshot) prevents that pool to work.

>>> MagicoderTask() == TextGenerationTask(system_prompt=SYSTEM_PROMPT_MAGICODER)
False

Yes, but I mean that if the MagicoderTask only changes the system_prompt then providing TextGenerationTask(system_prompt=SYSTEM_PROMPT_MAGICODER) would do the work and the check would pass as isinstance(task, TextGenerationTask) would match for all the LLMs within the same LLMPool as the task is shared, while the system_prompt or any other kwarg form the task may differ.

Commenting out that piece of code, it works fine, but as discussed with @gabrielmbmb it makes sense to at least check all the tasks returned at least the same output_args_names.

Maybe I missed a more direct way of tackling that.

Aligning that based on the output args may not be right in some scenarios where the generations are unrelated, so I think that it is low prio right now, as if we want to generate completions for different tasks as the same time, probably it's just better to run separate processes, as running different tasks (even if the output_args_names match) within the same LLMPool may not have any real use case, right?

plaguss commented 10 months ago

It changes the prompt too sorry.

SYSTEM_PROMPT_MAGICODER = "You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions."

MAGICODER_PROMPT = SYSTEM_PROMPT_MAGICODER +  """

@@ Instruction
{instruction}

@@ Response
"""

@dataclass
class MagicoderTask(TextGenerationTask):
    system_prompt: str = SYSTEM_PROMPT_MAGICODER

    def generate_prompt(self, input: str) -> Prompt:
        return Prompt(
            system_prompt=self.system_prompt,
            formatted_prompt=MAGICODER_PROMPT.format(instruction=input)
          )

TextGenerationTask(system_prompt=SYSTEM_PROMPT_MAGICODER)

Maybe we don't need to change anything in the codebase, I tried using prompt_formatting_fn in the LLM class but not sure how to deal with it. Is there a simple way to get it done with the code as is?

alvarobartt commented 10 months ago

It changes the prompt too sorry.

Then maybe we can just check that the class is either an instance or a subclass, so that a subclass of a given task i.e. TextGenerationTask is also accepted.

Maybe we don't need to change anything in the codebase, I tried using prompt_formatting_fn in the LLM class but not sure how to deal with it. Is there a simple way to get it done with the code as is?

Indeed those are unrelated at the end, the prompt_formatting_fn is to format the Prompt class i.e. Prompt.formatted_prompt needs to suit a Zephyr-like model so you can either inject the prompt formatting via a lambda or a custom function or via prompt_format.

plaguss commented 10 months ago

Ok, will create a PR with the new check for the instance