dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.91k stars 1.49k forks source link

[dagster-embedded-elt][dlt] handle conflicting pipeline names for parallelization / partitioning #21022

Open cmpadden opened 7 months ago

cmpadden commented 7 months ago

What's the use case?

Additional context here - https://github.com/dagster-io/dagster/discussions/17300#discussioncomment-9001114

There is currently a limitation in dlt in that if a pipeline with the same name runs in parallel it can cause unwanted behaviors and race conditions. This is acting as a blocker to implement parallelization and partitioning on dlt pipelines.

Ideas of implementation

  1. The concurrency issue is fixed upstream (see: https://github.com/dlt-hub/dlt/issues/1102)
  2. We make pipeline names unique via some random UID

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

cmpadden commented 6 months ago

Closing as partitioning support was added using non-conflicting pipeline names.

patrikdevlin commented 5 months ago

@cmpadden I think we should re-open this issue since modifying the pipeline_name contains unwanted side effects (as described here https://github.com/dagster-io/dagster/pull/22000#issuecomment-2174983213).

Supporting partitions within the dagster-dlt integration and fixing the concurrency issue are two separate tasks imo -- The work that @edsoncezar16 did actually enable partition support because it doesn't implicitly rely on the source and pipeline objects to be present in the metadata, however the pipeline_name in dlt has different uses besides being just a name so modifying the users input can cause issues.

edsoncezar16 commented 5 months ago

Hi, @patrikdevlin , thanks for finding this out. After careful consideration, I also agree that the partitions/concurrency are distinct problems. But let's wait for Colton's thoughts on how to proceed.

cmpadden commented 5 months ago

Thanks @patrikdevlin - I will re-open this issue, and believe we should wait for (or try to help with) an upstream fix.

benoitsdm commented 2 months ago

any workaround? how to run a backfill of a dlt assets?

cmpadden commented 1 month ago

any workaround? how to run a backfill of a dlt assets?

Hi @benoitsdm - a workaround is to set a concurrency limit on your dlt assets. See this guide for more information on how to do that.

https://docs.dagster.io/guides/limiting-concurrency-in-data-pipelines#limiting-concurrency-in-data-pipelines