dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.27k stars 1.42k forks source link

Race condition DuckDB in Dagster Essentials #22674

Open kornerc opened 3 months ago

kornerc commented 3 months ago

What's the issue or suggestion?

I am following the Dagster University course Dagster Essentials and encountered a race condition at the end of Lesson 4: Asset dependencies.

The assets have the following lineage: Global_Asset_Lineage

taxi_trips and taxi_zones both write data to the DuckDB database and manhattan_stats as well as trips_by_week read from it. With DuckDB it is only possible to have one open read/write connection or multiple read connections at the same time. In my case a race condition was triggered due to multiple open open read/write connections at the same time.

It is possible to set read_only=True when connecting to the database in manhattan_stats as well as trips_by_week, to reduce the problem. However, it is still possible to get a race condition with taxi_trips and taxi_zones, since both of them need a read/write connection. Nonetheless, it is possible to circumvent the problem with IOManagers or with using op_tags and setting the maximum parallel execution of assets with this tag to 1. But they are more advanced features.

So maybe it would be a solution to have only read connections for manhattan_stats as well as trips_by_week and add a warning that in case of an error due to a DuckDB connection issue it should be sufficient to re-execute the pipeline?

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

cmpadden commented 3 months ago

Thank you @kornerc - your proposition of using read-only connectors and concurrency limits for partitioned assets is quite elegant. I agree that it would be worthwhile to add a note of this in the lesson, as I recall a few other community members running into this issue too.

I will make a ticket for us internally to update the lesson accordingly, and will update here once complete. Thank you!

kornerc commented 3 months ago

@cmpadden thank you for your reply and the information.

Another alternative solution came into my mind: using backoff when connecting to the database, as it is done in the DuckDB resource https://github.com/dagster-io/dagster/blob/f88e4905eadcca34a0b5c7d4f385185ecc660172/python_modules/libraries/dagster-duckdb/dagster_duckdb/resource.py#L51 This might be the simplest solution because it is not necessary to manually limit the concurrency and introducing new Dagster concepts.

cmpadden commented 3 months ago

Good call! I really like that approach. Thanks for following up.