dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.13k stars 1.4k forks source link

Add @asset_hook - which would be very similar to @asset_check #20471

Open mkleinbort-ic opened 5 months ago

mkleinbort-ic commented 5 months ago

What's the use case?

I was trying to solve a problem, and realized I could abuse the @asset_check to solve it for me.

The problem I was trying to solve was:

I'd like to - whenever asset foo materializes (as a parquet file in blob storage) - I also want a copy of it as a lance file on blob storage.

I thought of solving this via an IO manager, but I didn't want to create a custom IO manager for just one asset

I thought of doing this as software defined asset that returns None but I've tried that before and it feels wrong

I thought I could wire this as an op that takes the software defined asset as a dependency - but ops are quite a differnt mental model then assets

Ok, so... I could just write an asset check? Not their purpose, but it'll work


@asset_check(asset=AssetKey('foo'))
def copy_as_indexed_lance_file(foo:pl.DataFrame)->AssetCheckResult:
    '''Copy over to blob storage as an indexed lance file'''

    # write_lance is my own method I monkeypatched into polars - don't worry about it 
    result_or_error = foo.sort('entityId', 'datetime').write_lance('az://MY-BUCKET/foo_as_lance', index_on=('entityId', 'datetime'))
    return AssetCheckResult(passed = result_or_error is None, metadata={})

Ideas of implementation

In a way I don't think anything NEEDS to be implemented, I'm just sharing that @asset_check can be used for a generic callback/hook that extends the things that happen when an asset is materialized.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

emirkmo commented 5 months ago

Each asset should really correspond to a data asset.. so what you say would make perfect sense as a second asset.. that returns None. Almost all of our assets return Output(None, …) unless there is a specialized IOManager configured.

However this is really useful for other reasons too, such as success_hooks! That don’t need an entire new asset sensor but are just triggered by the asset. For example a teams/slack message on success, like in a job/op, but for an asset. (Good tip, thanks..)