Open kornerc opened 3 months ago
Thank you @kornerc - your proposition of using read-only connectors and concurrency limits for partitioned assets is quite elegant. I agree that it would be worthwhile to add a note of this in the lesson, as I recall a few other community members running into this issue too.
I will make a ticket for us internally to update the lesson accordingly, and will update here once complete. Thank you!
@cmpadden thank you for your reply and the information.
Another alternative solution came into my mind: using backoff
when connecting to the database, as it is done in the DuckDB resource https://github.com/dagster-io/dagster/blob/f88e4905eadcca34a0b5c7d4f385185ecc660172/python_modules/libraries/dagster-duckdb/dagster_duckdb/resource.py#L51
This might be the simplest solution because it is not necessary to manually limit the concurrency and introducing new Dagster concepts.
Good call! I really like that approach. Thanks for following up.
What's the issue or suggestion?
I am following the Dagster University course
Dagster Essentials
and encountered a race condition at the end ofLesson 4: Asset dependencies
.The assets have the following lineage:
taxi_trips
andtaxi_zones
both write data to the DuckDB database andmanhattan_stats
as well astrips_by_week
read from it. With DuckDB it is only possible to have one open read/write connection or multiple read connections at the same time. In my case a race condition was triggered due to multiple open open read/write connections at the same time.It is possible to set
read_only=True
when connecting to the database inmanhattan_stats
as well astrips_by_week
, to reduce the problem. However, it is still possible to get a race condition withtaxi_trips
andtaxi_zones
, since both of them need a read/write connection. Nonetheless, it is possible to circumvent the problem with IOManagers or with usingop_tags
and setting the maximum parallel execution of assets with this tag to 1. But they are more advanced features.So maybe it would be a solution to have only read connections for
manhattan_stats
as well astrips_by_week
and add a warning that in case of an error due to a DuckDB connection issue it should be sufficient to re-execute the pipeline?Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.