argilla-io / distilabel

⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.
https://distilabel.argilla.io
Apache License 2.0
1.12k stars 70 forks source link

Add new step `CombineKeys` #747

Closed plaguss closed 1 week ago

plaguss commented 1 week ago

Description

This PR adds a new step to combine keys in a dict. The main use case I found was the following:

In a Pipeline I have a Task that generates instructions (say GenerateSentencePair), and I want more examples of the positives generated for diversity. With a new task, I generate more of those, so afterwards I want to combine the positive keys with the new extra instructions (say a list of instructions called positives or whatever). With this CombineKeys we can combine those keys:

from distilabel.steps import CombineKeys

combiner = CombineKeys(
    keys=["queries", "multiple_queries"],
    output_key="queries",
)
combiner.load()

result = next(
    combiner.process(
        [
            {
                "queries": "How are you?",
                "multiple_queries": ["What's up?", "Everything ok?"]
            }
        ],
    )
)
# >>> result
# [{'queries': ['How are you?', "What's up?", 'Everything ok?']}]
codspeed-hq[bot] commented 1 week ago

CodSpeed Performance Report

Merging #747 will not alter performance

Comparing combine-keys (fe14ffd) with combine-keys (6c5fde3)

Summary

✅ 1 untouched benchmarks