Open-Model-Initiative / OMI-Data-Pipeline

Apache License 2.0
32 stars 5 forks source link

Add ContentSource model and update content Model #21

Closed fearnworks closed 2 months ago

fearnworks commented 2 months ago

Description

We need to add a new data model table for ContentSource and update the Content model to handle different source types with their own specific extraction logic.

New ContentSource model

Add the following new model:

class ContentSourceType(enum.Enum):
    URL = "url"
    PATH = "path"
    HUGGING_FACE = "hugging_face"

class ContentSource(Base):
    __tablename__ = "content_sources"

    id = Column(Integer, primary_key=True, index=True)
    content_id = Column(Integer, ForeignKey('contents.id'))
    type = Column(Enum(ContentSourceType))
    value = Column(String)  # URL, local path, or Hugging Face dataset reference
    metadata = Column(JSON, nullable=True)  #additional source-specific data
    content = relationship("Content", back_populates="sources")

Tasks

Rationale

This change allows for handling different source types (URL, local path, Hugging Face dataset) that have their own specific extraction logic. It provides more flexibility and better organization for managing content sources.