argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.7k stars 132 forks source link

[FEATURE] Align API of standalone usage of `Step.process()` with `Pipeline.run()` `parameters` arguments #743

Open davidberenstein1957 opened 5 months ago

davidberenstein1957 commented 5 months ago

Is your feature request related to a problem? Please describe. Within the docs, we do advertise that people can use Steps as Standalone components, which prove useful for quick demos, prompt engineering etc, but the API usage doesn't align with passable parameters like within Pipeline.run(). This would help with iteration and building demos.

Describe the solution you'd like I would expect the Step. process() to potentially take a parameters argument or define a separate method for standalone usage, like Step.run() which does (assuming Step. process() is used within the Pipeline.run() and changes might be difficult.

Describe alternatives you've considered Looking into the source code and define/overwrite parameters through attribute assignment before calling Step. process().

Additional context N.A.

gabrielmbmb commented 5 months ago

Hi, it's true that in the docs maybe we show that Steps and Tasks can be used as standalone components, but they are designed to work within a Pipeline. It's not bad, but we have make clear that using an Step as standalone component should be only for quick testing or understanding how the step works.

Step runtime parameters have two ways of being set: via attributes or via set_runtime_parameters method. I think the problem is specifically with Tasks, in which the parameters of the LLM.generate method are programatically set in the library as runtime parameters of the task (this is because each LLM have different generation parameters) but they are not attributes of the Tasks, therefore the only option to set those parameters is using set_runtime_parameters method.

We can improve the docs around the Tasks to make clear that to set the generation parameters, set_runtime_parameters method has to be used:

task = MyTask(llm=MyLLM(...))
task.load()
task.set_runtime_parameters({"llm": {"temperature": 0.7}})

or allow passing parameters in the process method of the Tasks, but I'm not so convinced with this option because as I mentioned before, the Steps and Tasks are meant to be used within a Pipeline not as standalone components.

davidberenstein1957 commented 5 months ago

@gabrielmbmb, I think it would be clearer to unify the API usage and still emphasise they are intended to be used for testing instead of using and introducing something different like set_runtime_parameters.

I assumed that people who want to do something advanced and know what they are doing will use the Pipeline class but people who want to do something simple to create a basic demo or test can be facilitated a bit easier by unifying the API IMO.

Not sure if there are technical reasons behind it or a more philosophical one but for me the first flow is easier, assuming people are going to abuse Steps during testing anyhow.

task = MyTask(llm=MyLLM(...))
task.load()
task.process(
    parameters={"llm": {"temperature": 0.7}})
)

vs

task = MyTask(llm=MyLLM(...))
task.load()
task.set_runtime_parameters({"llm": {"temperature": 0.7}})
task.process()
gabrielmbmb commented 4 months ago

I just realised that someone can do:

task = MyTask(llm=MyLLM(generation_kwargs={"temperature": 0.7, "max_new_tokens": 2048}))
task.load()

I think that should be enough if someone wants to use the task as a standalone component. We can update the examples to include generation_kwargs.