dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.25k stars 1.42k forks source link

Docs for partitioned backfills on sling runs #24234

Open nrsmac opened 3 weeks ago

nrsmac commented 3 weeks ago

What's the issue or suggestion?

There isn't a clear documented way to use partitions in Sling. I see I can provide a partitions_def but how do those values pass to Sling for a backfill? https://docs.dagster.io/_apidocs/libraries/dagster-embedded-elt#sling-dagster-embedded-elt-sling

My defined asset:

from dagster_embedded_elt.sling import SlingResource, sling_assets
from dagster import file_relative_path
from partitions import daily_partitions_def

replication_config = file_relative_path(__file__, "../resources/replication.yml")

@sling_assets(replication_config=replication_config, partitions_def=daily_partitions_def)
def sling_assets(context, sling: SlingResource):
    yield from sling.replicate(context=context)  # Tried passing as kwargs here...
    for row in sling.stream_raw_logs():
        context.log.info(row)

In the Sling documentation, it gives an example of passing environment variables to Sling https://docs.slingdata.io/sling-cli/run/configuration/variables.

replication.yml:

source: SQLDB
target: DUCKDB

defaults:
  mode: backfill
  object: "{stream_schema}_{stream_table}"
  source_options:
    empty_as_null: false
  target_options:
    column_casing: snake
streams:
  example.stream:
    object: example.object
    primary_key: pk
    update_key: start_time
    source_options:
      limit: 1000
      #range: 2024-07-01,2024-07-02
      range:${START_DATE},${END_DATE}   # How do I get partition keys to populate here?
env:
  SLING_LOADED_AT_COLUMN: true
  SLING_STREAM_URL_COLUMN: true
  start_date: '${START_DATE}'  # In the case of using envvars, but I want the partition keys from the execution context here.
  end_date: '${END_DATE}' 

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

cmpadden commented 2 weeks ago

Hi @nrsmac - @nicklausroach and I are going to explore this, and plan to update the documentation accordingly.

The replicate method passes the environment variables to the Sling subprocess. So one possible solution is to set the environment variables from the partition key. For example:

@sling_assets(
    replication_config=config_dir / "example.yaml",
    dagster_sling_translator=CustomSlingTranslatorMain(),
    partitions_def=DailyPartitionsDefinition(start_date=datetime.now()),
)
def example_sling_assets(context, embedded_elt: SlingResource):
    start_date = context.partition_key
    os.environ['START_DATE'] = start_date
    os.environ['END_DATE'] = start_date + timedelta(days=1)
    yield from embedded_elt.replicate(context=context)

Will keep you posted as we update docs. Please let me know if you make any progress yourself. Thanks!