#5 Feat: Implement Pipeline Builder for ease of pipeline creation

Why?

This feature is needed to streamline the creation of machine learning pipelines based on configuration files for different operations. By implementing this PipelineBuilder class, we can dynamically load configurations and create pipelines specific to the tasks such as text-to-embedding, embedding-to-text, text segmentation, and metric analysis. This makes the process of pipeline creation more modular, maintainable, and extensible, catering to different datasets and operations in a standardized way.

Use Case:

To facilitate the automated creation of different types of ML pipelines based on configuration files.
To allow easy extension and customization for new operations and datasets.
To improve maintainability by centralizing the pipeline creation logic.

How?

Technical Decisions:

Directory Structure:

The config_dir defaults to huggingface_pipelines/datacards. This directory is assumed to contain subdirectories for each dataset, where operation-specific YAML configuration files are stored.

Example structure:

huggingface_pipelines/datacards/
├── dataset_name1/
│   ├── text_to_embedding.yaml
│   ├── embedding_to_text.yaml
└── ── dataset_name2/
   ├── text_segmentation.yaml
   ├── analyze_metric.yaml

Factory Pattern:
- A factory pattern is used to create pipelines. Each operation has a corresponding factory (e.g., TextToEmbeddingPipelineFactory) that is responsible for creating the pipeline based on the configuration.
- This pattern allows for easy extension; new operations can be supported by simply adding a new factory class and updating the pipeline_factories dictionary.
Configuration Loading:
- Configuration files are loaded based on the dataset name and operation type. If a configuration file does not exist for a given dataset and operation, a FileNotFoundError is raised with an appropriate error message logged.
- YAML is used for configuration files for readability and ease of editing.
Pipeline Creation:
- The create_pipeline method dynamically loads the configuration and uses the appropriate factory to create the pipeline.

Work In Progress:

If additional operations are needed in the future (e.g., audio_preprocessing), corresponding factory classes and YAML configurations must be created.
Additional error handling might be necessary for more robust operation (e.g., validating configuration content).

Test Plan

Unit Testing:

Created unit tests for the PipelineBuilder class to verify:
1. Successful loading of configuration files.
2. Correct creation of pipelines for each supported operation.
3. Proper error handling when configuration files are missing or operations are unsupported.

Command Line Testing:

Example command to test pipeline creation for a text-to-embedding operation:

builder = PipelineBuilder(config_dir="path/to/configs")
pipeline = builder.create_pipeline(dataset_name="sample_dataset", operation="text_to_embedding")
assert pipeline is not None, "Pipeline creation failed"

Tested edge cases like missing configuration files and unsupported operations to ensure the class behaves as expected.

Integration Testing:

Integrated the PipelineBuilder into the main application workflow and verified that pipelines are correctly built and executed for real datasets.
Verified logging output to ensure errors and important information are correctly logged.

facebookresearch / SONAR