facebookresearch / SONAR

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.
Other
343 stars 34 forks source link

#5 Feat: Implement Pipeline Builder for ease of pipeline creation #37

Open botirk38 opened 3 months ago

botirk38 commented 3 months ago

Why?

This feature is needed to streamline the creation of machine learning pipelines based on configuration files for different operations. By implementing this PipelineBuilder class, we can dynamically load configurations and create pipelines specific to the tasks such as text-to-embedding, embedding-to-text, text segmentation, and metric analysis. This makes the process of pipeline creation more modular, maintainable, and extensible, catering to different datasets and operations in a standardized way.

Use Case:

How?

Technical Decisions:

  1. Directory Structure:

    • The config_dir defaults to huggingface_pipelines/datacards. This directory is assumed to contain subdirectories for each dataset, where operation-specific YAML configuration files are stored.
    • Example structure:
      huggingface_pipelines/datacards/
      ├── dataset_name1/
      │   ├── text_to_embedding.yaml
      │   ├── embedding_to_text.yaml
      └── ── dataset_name2/
         ├── text_segmentation.yaml
         ├── analyze_metric.yaml
  2. Factory Pattern:

    • A factory pattern is used to create pipelines. Each operation has a corresponding factory (e.g., TextToEmbeddingPipelineFactory) that is responsible for creating the pipeline based on the configuration.
    • This pattern allows for easy extension; new operations can be supported by simply adding a new factory class and updating the pipeline_factories dictionary.
  3. Configuration Loading:

    • Configuration files are loaded based on the dataset name and operation type. If a configuration file does not exist for a given dataset and operation, a FileNotFoundError is raised with an appropriate error message logged.
    • YAML is used for configuration files for readability and ease of editing.
  4. Pipeline Creation:

    • The create_pipeline method dynamically loads the configuration and uses the appropriate factory to create the pipeline.

Work In Progress:

Test Plan

Unit Testing:

Command Line Testing:

Integration Testing:

avidale commented 1 month ago

Are we going to merge this?