argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.45k stars 111 forks source link

Optionally include the pipeline script in the hub when pushing your distiset #762

Closed plaguss closed 2 months ago

plaguss commented 3 months ago

Description

This PR add the option of pushing the script of the pipeline being run to the hugging face hub (by default it will be set to False, to avoid potential errors):

with Pipeline() as pipe:
    ...
distiset = pipeline.run(use_cache=False)
distiset.push_to_hub("plaguss/pipe_nothing_test", include_script=True)

This simplifies sharing the code that created the pipeline, as well as custom steps.

Example script.

If the script was uploaded to the hub, an entry will be written in the README.md of the repo to show it:

image

The cli has also been updated to allow running remote (or local) scripts, as we do with pipelines defined in their pipeline.yaml config file:

distilabel pipeline run --trust-code "https://huggingface.co/datasets/plaguss/pipe_nothing_test/raw/main/pipe_nothing.py"
distilabel pipeline run --trust-code "path/to/pipe_nothing.py"
codspeed-hq[bot] commented 3 months ago

CodSpeed Performance Report

Merging #762 will not alter performance

Comparing push-pipeline-py (cbcf0bf) with develop (647d040)

Summary

✅ 1 untouched benchmarks