dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
10.81k stars 1.35k forks source link

Pythonic Input/Output Metadata #15125

Open danielgafni opened 1 year ago

danielgafni commented 1 year ago

What's the use case?

I'm currently working on dagster-polars, and I'm trying to provide ways for the user to control the IOManager behavior. This is usually done via input/output metadata. For example, for the PolarsDeltaIOManager, I'm trying to support all the read/write options deltalake has. The problems with the current design are:

Having a Pydantic-based metadata object instead would solve both. You could instantly jump to the class definition and see all the attributes it has. Both code readability and "documentation" (it would be self-documenting) would be improved. Obviously, you would also get IDE help, type safety, etc.

Ideas of implementation

Allow the user to define a specific classes for this IOManager input and output metadata. They would inherit from InputMetadata and OutputMetadata. Allow the load_input and handle_output methods to take an additional argument which Dagster would fill with the correct class instance. This can be done with backwards-compatibility, right? Allow passing instances of these classes to @asset, AssetIn, AssetOut, and events logging.

from dagster import InputMetadata, OutputMetadata, IOManager, asset

class MyInputMetadata(InputMetadata):
    foo: str = "bar"

class MyOutputMetadata(OutputMetadata):
    asd: str = "sda"

class MyIOManager(IOManager):
    def load_input(context: InputContext, metadata: MyInputMetadata):
        ...

    def handle_output(context: OutputContext, metadata: MyOutputMetadata, obj):
        ...

@asset(
    metadata=MyOutputMetadata()
)
def my_asset():
    ...

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

abkfenris commented 1 year ago

Some similar pondering on IOManager metadata https://github.com/dagster-io/dagster/discussions/6913

jamiedemaria commented 1 year ago

this is a cool idea! having more structure IO manager metadata is something we've talked about a couple times, but haven't prioritized. tagging @sryza and @schrockn since you might find this relevant!

ei-grad commented 1 year ago

Thank you for your detailed proposal. It certainly opens up an interesting discussion. I do have a few thoughts that I'd like to share.

Firstly, wouldn't having IOManager-specific asset metadata potentially limit the flexibility of using the asset with a different IOManager class, especially if the metadata structure is incompatible?

From my experience, I've found it beneficial to provide the asset-specific configuration for IOManager within its own configuration. For instance, we could add a resource config field with parameters specified by a dictionary, the keys of which are equal to the asset_key values. This seems to offer a good balance of flexibility and structure.

Regarding the suggested metadata argument in the load_input and handle_output methods:

    def load_input(context: InputContext, metadata: MyInputMetadata):
    def handle_output(context: OutputContext, metadata: MyOutputMetadata, obj):

I believe the current design is not as straightforward to adapt to this proposed change. For the input context, you have the metadata of the asset for which you load the calculation job, and an upstream_output context with its own metadata (which seems to be the metadata you're expecting). This seems slightly more complex than it might need to be.

Perhaps it would be more fruitful to rethink or refactor this part of the design, rather than focusing on structuring the metadata? Or something from the concerns mentioned in https://github.com/dagster-io/dagster/discussions/6913 (to which @abkfenris already gave reference above). Additionally, introducing a metadata argument to these methods would alter the public interface and could potentially disrupt existing custom IOManager code.

Lastly, there are other important parameters in the contexts, should metadata be singled out in this way?

I hope these thoughts provide some useful feedback. I'm eager to hear your thoughts on these points.

sryza commented 1 year ago

I believe this is related to or a slightly-more-specific version of https://github.com/dagster-io/dagster/issues/9617

I think there are a lot of API questions we'd need to think through, but I'm excited about eventually adding something like this.