argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.46k stars 113 forks source link

[FEATURE] Naming of Pipeline and pipeline #142

Closed ignacioct closed 9 months ago

ignacioct commented 10 months ago

Hey there guys! While creating the tutorial for the end2end example using Notus, I've encountered the difference between Pipeline and pipeline from distilabel.pipeline. Pipeline is the initializer of the Pipeline class, and the pipeline() function creates a Pipeline instance with the provided LLMs for a given task, which is useful whenever you want to use a pre-defined Pipeline for a given task, or if you want to create a custom Pipeline for a given task.

However, I find these two names kind of colliding, especially when there are a lot of pipeline = Pipeline(...) that would create a pipeline variable, the same name as the pipeline method. What do you guys think about changing the name to something like predefined_pipeline()? Not the best name IMO but we can come up with a few suggestions. We would need to take care with the refactoring, as we do not have testing coverage yet, to not break anything.

@plaguss @dvsrepo @alvarobartt @gabrielmbmb

plaguss commented 10 months ago

Hi @ignacioct! I wasn't there with the initial definition, but I think it gets inspiration from the transformers definitions: pipeline and Pipeline. With transformers we don't really instantiate the Pipeline directly as we do here, it's a base class, but anyway their approach is to call the result from pipeline pipe, and maybe that would be enough (for example, pipe = Pipeline(...)/pipe = pipeline(...). I do get that it could lead to naming collision too, but just to see a different point

alvarobartt commented 10 months ago

Hi here @ignacioct thanks for opening this issue!

We wanted to add a pipeline method to be aligned with HuggingFace's approach on supporting a wide variety of pre-defined tasks via the same method, while I understand it can be a bit confusing, IMHO it's nice to leave it as is also thinking on what's next, so we can come back to this once we have more tasks included within the pipeline to actually justify its worth keeping, other than that the method could be removed completely as of now.

So feel free to directly use Pipeline class for most of the examples, and we can try to mock what's next for the pipeline method and how it can be useful (more than now).

Thanks again! Feel free to close the issue if you agree and we can re-open it on a later stage, or just continue the discussion if you don't 😀