dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.67k stars 1.47k forks source link

[docs] - Improve Pandera guide #8255

Open dagsir[bot] opened 2 years ago

dagsir[bot] commented 2 years ago

Summary

Update the Pandera guide to:


Issue from the Dagster Slack

This issue was generated from the slack conversation at: https://dagster.slack.com/archives/C01U954MEER/p1654667850385989?thread_ts=1654667850.385989&cid=C01U954MEER


Conversation excerpt

U03G3ND6C03: Hi all, I'm looking to use pandera to validate my SDA's. I'm looking to validate my raw data assets, which are straight dumps of the source data. However, there are spaces in the raw data field names, and I'm looking to use the dagster-pandera API which looks like below. Is there a way to overcome the spaces, preferably without changing the raw column names?

class Member_Schema(pa.SchemaModel):
  # col_name: Series[expected data type] - pa.Field()
  client number: Series[float64] = pa.Field()
  account number: Series[object] = pa.Field()

U015C9U9RLK: <@U018K0G2Y85> issue dagster-pandera doesn’t handle spaces in col names

U01GTMVMGQH: Hi Barry, dagster-pandera supports either of pandera’s formats for defining a dataframe schema-- the SchemaModel approach (which is illustrated in your snippet) and the pa.DataFrameSchema approach. For columns with spaces, you should use the pa.DataFrameSchema approach:

from dagster_pandera import pandera_schema_to_dagster_type
import pandera as pa
member_schema = pa.DataFrameSchema(
    {
        "client number": pa.Column(float),
        "account number": pa.Column(object)
    }
)

df_type = pandera_schema_to_dagster_type(member_schema)

See Pandera docs for more on the DataFrameSchema object.

U03G3ND6C03: Ok sweet! So am I able to pass in the member_schema to my asset like so? It should work for either format of the schema?

@asset(dagster_type=pandera_schema_to_dagster_type(Member_Schema))

Message from the maintainers:

Do you care about this too? Give it a :thumbsup:. We factor engagement into prioritization.

smackesey commented 2 years ago

This was a question about whether dagster-pandera supports something (it does). Solution here is to improve docs by linking the API doc from the guide and also emphasizing that either schema-defining approach is supported.