argilla-io / distilabel

⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.
https://distilabel.argilla.io
Apache License 2.0
1.12k stars 70 forks source link

Add load stages #760

Closed gabrielmbmb closed 8 hours ago

gabrielmbmb commented 6 days ago

Description

This PR adds a new feature in which the steps are divided in several load stages marked by the position of the GlobalSteps in the pipeline. GlobalSteps receives all the data at once (in one batch), and therefore, they require all its previous steps to have finished before being able of processing the data. Having that said, it's not necessary to load the GlobalStep until its previous steps have finished its execution, saving some resources in the meantime. Also, it's not necessary to load the successors steps of a GlobalStep until it has finished it's execution. Therefore, the load stages are marked by the position of the GlobalSteps in a pipeline:

  1. Previous steps of a GlobalStep will be grouped in a stage.
  2. Each GlobalStep will have it's own stage.
  3. Successors of a GlobalStep will be grouped in a stage.
codspeed-hq[bot] commented 6 days ago

CodSpeed Performance Report

Merging #760 will not alter performance

Comparing steps-load-stages (b5605fb) with develop (91bc0fa)

Summary

✅ 1 untouched benchmarks