dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.23k stars 1.41k forks source link

[DB IO managers] support when a new column is added to an asset #13098

Open jamiedemaria opened 1 year ago

jamiedemaria commented 1 year ago

Right now if you have a partitioned asset and you change the dataframe returned by the asset to have a new column, the IO manager will fail (specifically the snowflake pandas io manager will fail, still need to test the other io managers). This is because, for partitioned assets, we don't fully delete and replace the table for the asset materialization. Instead we drop the table (but retain the column schema) or delete the rows corresponding to the partition and then append the new dataframe. Since the new dataframe has a column that doesn't exist in the snowflake table, we see an error

Ideas of implementation Two options (potentially more as well)

a) introspect on the dataframe, find the new column, map it to the DB column type, and manually create the column. We'd need to find a way to keep our mapping of python type to database type in sync.

b) copy the full table into memory, append the new partition data with the new column, drop the table, the load the new table into the DB. This would be super memory intensive for large tables, which is likely a deal breaker

Additional information OG slack thread https://dagster.slack.com/archives/C01U954MEER/p1679514144698129

Message from the maintainers Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

sryza commented 1 year ago

I agree that (b) is a no-go, for the reason you listed.

sryza commented 1 year ago

(a) seems promising

jamiedemaria commented 1 year ago

This issue also exists with non-partitioned assets. Replication case

@asset
def a_plain_table() -> pd.DataFrame:
    return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6],})

and materialize. Then update asset def to

@asset
def a_plain_table() -> pd.DataFrame:
    return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})

and materialization will fail

DanMeyer commented 1 year ago

I've been doing (b) manually. I have smaller tables than most (the largest one has 2-3 million records), so this hasn't been a problem performance-wise. However, I have a few Snowflake roles (some for user-facing apps) that only have access to a few specific tables (so I can't just grant select on all future tables in a schema), and I lose those permission grants when I drop the tables.

My best idea so far is include a statement granting those permissions as part of any such assets.

malow106 commented 4 months ago

Hi, any plan for this issue ? Thx !