argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.45k stars 111 forks source link

Add cache at `Step` level #766

Open plaguss opened 3 months ago

plaguss commented 3 months ago

Description

This PR implements cache at step level.

Previously, we computed a signature for a pipeline, and when this signature changed, we recomputed everything. Now the idea is to compute the signature per step, and once the signature has changed, only recompute the steps whose signature (or preceding) has changed. So for a pipeline A -> B -> C, if B step changes, we will recompute only B and C, but we will start with the data we had from A.

New cases we control with this change:

step_b = MyStep(
    name="step_b",
    input_batch_size=10,
    use_cache=False,
)

Note: This has an impact in how we read the previous serialized parquet files, if any step's use_cache is set to False, for a pipeline that hasn't changed, we won't read the previous serialized content.

Closes #651

codspeed-hq[bot] commented 3 months ago

CodSpeed Performance Report

Merging #766 will improve performances by 75.94%

Comparing cache-per-step (bb28b0b) with develop (a178109)

Summary

⚡ 1 improvements

Benchmarks breakdown

Benchmark develop cache-per-step Change
test_cache_time 394.7 ms 224.3 ms +75.94%
github-actions[bot] commented 2 months ago

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-766/