drivendataorg / erdantic

Entity relationship diagrams for Python data model classes like Pydantic
https://erdantic.drivendata.org/
MIT License
330 stars 21 forks source link

Feature Request: Visualize Data Transformations Between Pydantic Models #129

Open lucas-nelson-uiuc opened 6 days ago

lucas-nelson-uiuc commented 6 days ago

Hey! Just found out about erdantic through Python Bytes. Looks great and love messing with it so far.

I work with a lot of Pydantic models to facilitate PySpark transformations - given a model, you can read, transform, and validate a raw file or loaded DataFrame against a model. Since it's built on Pydantic, it allows some nice features (nesting models, ease of documentation, etc.) and encourages declarative/composable pipelines.

However, instead of composing fields as collections of other models, I convert data from one model to the next. Most examples look like this:

import datetime
import decimal

from pydantic import BaseModel, Field
from pyspark.sql import functions as F

# describe raw data as model to facilitate read and preprocessing steps
class RawFinancialStatement(BaseModel):
    acct: str = Field(pattern=r"\d{5}")
    descr: str
    posted: datetime.date = Field(
        ge=datetime.date(2024, 1, 1), le=datetime.date(2024, 12, 31)
    )
    amount: decimal.Decimal

# for all files, read-in using model's schema, union together, then transform and validate against model
raw_data = RawFinancialStatement.read(
    source=["path/to/file.csv", "path/to/another_file.csv"]
)

# convert intermediate model to expected model for analytical workflows
class CommonFinancialStatement(BaseModel):  # or inherits from a defined business model
    account_number: str = Field(alias="acct")
    account_description: str = Field(alias="descr")
    date_effective: datetime.date = Field(alias="posted")
    date_posted: datetime.date = Field(alias="posted")
    net_amount: decimal.Decimal = Field(alias="amount")
    user_posted: str = Field(
        default=F.when(F.col("acct").startswith("A"), "USER1").otherwise("USER2")
    )

# transform and validate data against model
processed_data = CommonFinancialStatement.transform(data=raw_data).validate()

Using erdantic, would it be possible to construct an ER diagram between multiple models that simply describe how data is transformed? Please let me know if I need to explain my use case some more - thank you!

jayqi commented 5 days ago

Hi @lucas-nelson-uiuc,

Thanks for trying out erdantic!

I'd like to better understand your use case. Some questions here: