argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.35k stars 92 forks source link

[FEATURE] Make LLM runtime parameters available in LLM instantiation #803

Open dvsrepo opened 1 month ago

dvsrepo commented 1 month ago

I remember we've briefly discussed this in the past and there were some actions, but I think this can have a positive impact on usability.

Go from:

with Pipeline(
    name="llama3-text-generation-pipeline",
) as pipeline:
    #...
    generation_with3 = TextGeneration(
      llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
      ),
    )

    load_dataset >> [generation_with3,generation_with3_1] >> CombineColumns(columns=["generation", "model_name"])

if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            generation_with3.name: {
                "llm": {
                    "temperature": 1.0, 
                    "do_sample": True,
                    "frequency_penalty": 0.1
                }
            }
        },
        use_cache=False
    )  

to:

with Pipeline(
    name="llama3-text-generation-pipeline",
) as pipeline:

    #...

    generation_with3 = TextGeneration(
      llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-8B-Instruct",
        temperature=1.0,
        do_sample=True,
        frequency_penalty=0.1
      ),
    )
    load_dataset >> [generation_with3,generation_with3_1] >> CombineColumns(columns=["generation", "model_name"])

if __name__ == "__main__":
    distiset = pipeline.run(use_cache=False)

This has several benefits:

  1. I don't need to configure my llm in two places and in two different ways (dict config vs init)
  2. Could this help to understand which params I can use for each LLM without needing to dig into the internals?
  3. A less fragmented API
  4. We wrap the generate/agenerate call anyway so why not letting users config everything into one place (LLM init).

I understand in this specific case (with the use_openai_client param) it might be challenging to expose all possible params but then maybe there are other decisions to make in terms of design.

Note that If someone runs the pipeline with parameters this will still override the values passed at init (I recall this was one of the arguments in favor of runtime params: reusability).

WDYT @gabrielmbmb @plaguss ? I'm especially interested in potential issues for (1) advanced users, and (2) maintenance.

gabrielmbmb commented 1 month ago

Hi @dvsrepo!

The main reason to not have the generation parameters as attributes was that when we did the initial design we decided to put those in the generate or agenerate method so we didn't shoot ourselves in the foot for some cases that we imagined at that moment like different generation parameters per row (something we didn't end up implementing).

Having that said, one can define the generation parameters at initialisation using the generation_kwargs attribute:

generation_with3 = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    generation_kwargs={
        "temperature": 1.0, 
        "do_sample": True,
        "frequency_penalty": 0.1
    }
)

We don't have this documented, so probably a good start is to document this attribute. We should check if we foresee some use case that could benefit from defining the generation parameters per row or if we consider it's worth it to keep the generation parameters as arguments of the generate method or agenerate method.

Something that could be done is to keep the generation parameters as arguments of the method and also have them as attributes, so we keep the best of both worlds at the cost of a bit more of maintenance (easy to maintain tbh). WDYT?

dvsrepo commented 1 month ago

Thanks @gabrielmbmb !

+1 to briefly document generation_kwargs as quick step but still has some of the limitations stated above (two ways of configuring the same object or knowing the internals, ie there's a generate method to be called, dict makes the params less intuitive/discoverable)

Something that could be done is to keep the generation parameters as arguments of the method and also have them as attributes, so we keep the best of both worlds at the cost of a bit more of maintenance (easy to maintain tbh). WDYT?

Do you mean being able to define those both during e.g.,InferenceEndpointsLLM(temperature=1.0) but also passed as parameters to the pipeline.run call? If so, this is exactly what I was suggesting