argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.42k stars 103 forks source link

[FEATURE] Add defaults to Steps and Tasks so they can be more easily connected #802

Open dvsrepo opened 2 months ago

dvsrepo commented 2 months ago

[Updated: Added suggestion 2. for Ultrafeedback] This is an issue to discuss the defaults for (some) Steps and Task.

The main idea is to think about the most frequent uses and paths for certain components. I'll start:

  1. CombineColumns: In my experience, the most likely usage of this is after several generation steps (which by default output generation and model_name). Would it make sense to set this as default values for columns . This could allow to go from:
    
    [text_generation, text_generation2] >> CombineColumns(columns=["generation", "model_name"])

to

text_generation >> CombineColumns()



It might not seem as a very impactful change but I think adding these small changes will compound and give a more intuitive and approachable usage/onboarding. The most important is not having shorter code but less things to understand and memorize for repetitive actions.

2. `UltraFeedback`: The most likely path would be after `CombineColumns` so why not setting up the default to `merged_generation` so I can interconnect [text_gen1, text_gen2] >> CombineColumns() >> UltraFeedback(llm=...).

I'd love to hear your thoughts @gabrielmbmb @plaguss as I might be biased in the most likely paths, I'm not trying to give the final answer as to what the defaults should be but just that: thinking about most likely connections across steps and defining the defaults according to those can simplify usage
gabrielmbmb commented 2 months ago

Yes, I think it makes sense to have these sensible defaults. Something related I was thinking is to also have auxiliary functions that creates several steps, kind of syntactic sugar:

def generate_with(models: List[str]): ...

with Pipeline(name="my-pipe") as pipeline:
     load_data = ...

     load_data >> generate_with(["llama3", "mistral", "gpt4"])

but not 100% sure about it because it can hide many details and make the user confuse