dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.01k stars 1.38k forks source link

Pydantic v2 breaks passing dictionary values in ConfigurableIOManagers [dagster-polars] #21289

Open ion-elgreco opened 4 months ago

ion-elgreco commented 4 months ago

Dagster version

1.7.1

What's the issue?

We noticed that values that we are passing dictionaries with non-env vars through a resource (PolarsParquetIOManager), causes them to get lost once they go through the Definitions object. Before that they are still there. See below:

from dagster_polars import PolarsParquetIOManager
from dagster import Definitions
io_manager = PolarsParquetIOManager(storage_options={"key":"value"})

print(io_manager.cloud_storage_options)
{'key': 'value'}

Definitions(resources={"parquet_io_manager":io_manager})._created_pending_or_normal_repo.get_top_level_resources()['parquet_io_manager'].get_config_field()

Field(<dagster._config.field_utils.Shape object at 0x7fea437933a0>, default={'extension': '.parquet', 'base_dir': None, 'storage_options': {}}, is_required=False)

Fyi @danielgafni

What did you expect to happen?

No response

How to reproduce?

No response

Deployment type

None

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

danielgafni commented 4 months ago

I think you are printing the default values, not the configured values

ion-elgreco commented 4 months ago

I think you are printing the default values, not the configured values

@danielgafni with pydantic V1 it will show all the configured values.

With pydantic V2 it will only show the env vars, but I didn't add that in the example above.

Also it's just a way to show that non env vars disappear. You can see it as well in the UI

ShootingStarD commented 1 month ago

We suffer the same problem in our pipeline. Even though we pass the "aws_endpoint_url" storage_options to the io_manager, it overwrittes it as an empty dict and in the scan_parquet, it pass an empty dict as storage_options which does not use our desired "aws_endpoint_url"

Tazoeur commented 1 month ago

The issue seems to be located in the ConfigurableResourceFactory.

I can see that the values passed to the config have been modified between data_without_resources and casted_data_without_resources.

https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_config/pythonic_config/resource.py#L331-L346

>>> data_without_resources
{'base_dir': 's3://my-bucket', 'storage_options': {'aws_access_key_id': 'minioadmin', 'aws_secret_access_key': 'minioadmin', 'aws_endpoint_url': 'http://localhost:32808'}}
>>> casted_data_without_resources
{'base_dir': 's3://my-bucket', 'storage_options': {}}

I believe that the pydantic aliases are not correctly taken into account.