argilla-io / distilabel

⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.
https://distilabel.argilla.io
Apache License 2.0
1.12k stars 70 forks source link

Add `requirements` list for `Pipeline` #720

Closed plaguss closed 6 days ago

plaguss commented 3 weeks ago

Description

This PR adds a new attribute to requirements to BasePipeline to keep track of the dependencies needed to run a Pipeline. The pipeline.dump() method now contains a new key with the requirements, if any.

We can include requirements at the Pipeline level, and ideally we would add requirements for custom steps via @requirements decorator, to avoid making the step definition more verbose.

It will throw a ValueError before running and show the dependencies that aren't already installed in your environment.

@requirements(["distilabel>=0.0.1"])
class CustomStep(Step):
    @property
    def inputs(self) -> List[str]:
        return ["instruction"]

    @property
    def outputs(self) -> List[str]:
        return ["response"]

    def process(self, inputs: StepInput) -> StepOutput:  # type: ignore
        for input in inputs:
            input["response"] = "unit test"
        yield inputs

with BasePipeline(
    name="unit-test-pipeline", requirements=["random_requirement"]
) as pipeline:
    gen_step = DummyGeneratorStep()
    step1_0 = DummyStep()
    step2 = CustomStep()

    gen_step >> step1_0 >> step2
pipeline.run()
codspeed-hq[bot] commented 3 weeks ago

CodSpeed Performance Report

Merging #720 will not alter performance

Comparing pipeline-requirements (77d8211) with develop (63ee8c5)

Summary

✅ 1 untouched benchmarks