dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.53k stars 1.45k forks source link

Support "materialize all" for multi-assets in Dagit. #14759

Open jrouly opened 1 year ago

jrouly commented 1 year ago

What's the use case?

We have a @multi_asset (def featurize) that produces multiple outputs (imputed_features and features). Both of these outputs are always (unconditionally) materialized from the featurize multi-asset.

We have modeled the asset something like this:

@multi_asset(outs={"imputed_features": AssetOut(), "features": AssetOut()})
def featurize(context):
    yield Output(value=123, output_name="imputed_features")
    yield Output(value=456, output_name="features")

When you load Dagit and go to materialize featurize and both of its outputs, you can't! Instead, you can only see imputed_features and features in the asset catalog. Which means, when you want to materialize featurize, you have to select both imputed_features and features to materialize. Else, Dagit will present you with an error (since both outputs are required).

This is pretty inconvenient for large asset graphs -- you have to know all of the outputs for a multi-asset, and how many may/may not be required, before you can even make a selection and hit materialize.

What we would very much like to be able to do is simply "materialize featurize". In other words, see the multi-asset itself in Dagit in the asset graph, with the outputs linked to it. It would probably look very similar to the way asset groups look. And then be able to materialize all of the outputs together.

subsets

Of note, we can achieve the desired behavior by (not super intuitively) making the asset outputs optional.

@multi_asset(
    outs={
        "imputed_features": AssetOut(is_required=False),
        "features": AssetOut(is_required=False),
    },
    can_subset=True,
)
def featurize(context):
    yield Output(value=123, output_name="imputed_features")
    yield Output(value=456, output_name="features")

In this case, you can materialize one or the other output, and then both outputs will be materialized. However, there are some drawbacks to this approach.

  1. The code no longer accurately models the behavior of the assets. In truth, all of the asset outputs really are required. There will never be a subset of outputs.
  2. All Dagit screens visually represent only the selected asset output as being materialized. For example, if you select imputed_features for materialization, the run display and run history both only include imputed_features, even though features also gets materialized.
  3. The fact that features also gets materialized shows up as a warning in the logs, that an "unexpected asset" is being materialized.

Ideas of implementation

Treat multi-assets more like asset groups in Dagit. Allow the @multi_asset itself to be featurized as a whole.

It's important to have @multi_asset distinct from an asset group, because the code that is generating these outputs cannot be teased out into separate assets. Otherwise we'd use an asset group. But being able to "materialize all" for a @multi_asset, like we can for asset groups, would be ideal.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

OwenKephart commented 1 year ago

One other potential angle on this -- when you're looking at a subset of the asset graph, and click on one asset that's part of the multi-asset, all assets within that multi-asset could be highlighted (assuming the multi-asset is not subsettable). This would make it much easier to reliably select the desired assets from those views.