dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.94k stars 1.49k forks source link

Definitions explanation is not clear or understandable (and other suggestions) #25717

Open cobrienbeam opened 3 weeks ago

cobrienbeam commented 3 weeks ago

What's the issue or suggestion?

A Definitions object is a set of Dagster definitions available and loadable by Dagster tools.

This is a circular sentence. If a definitions object is a set of Dagster definitions available then what are the Dagster definitions and what makes them available vs not available? It's totally unclear.

Additionally, the added explanation does not really help explain:

The Definitions object is used to assign definitions to a code location, and each code location can only have a single Definitions object. This object maps to one code location. With code locations, users isolate multiple Dagster projects from each other without requiring multiple deployments. You’ll learn more about code locations a bit later in this lesson.

What are code locations, and why can they have only a single Definitions object? Okay so the cardinality between Defintions objects and code locations are 1:1, but that doesn't really explain the rest of it.

Additional information

A Definitions object is like a project manifest for Dagster - it bundles together all the assets, jobs, schedules, and other components that make up a single Dagster project. It's like a menu that tells Dagster exactly what's available to run in this specific project. Each separate project (called a code location) needs its own Definitions object, and you can't have multiple Definitions objects in the same location. This setup lets you keep different Dagster projects completely separate from each other, without needing to set up multiple Dagster deployments.

Why do we need this?

Two main reasons:

  1. Project Isolation: Let's say you have two different data projects:
# analytics/definitions.py
defs = Definitions(
    assets=[revenue_dashboard, customer_metrics]
)

# marketing/definitions.py
defs = Definitions(
    assets=[email_campaigns, social_media_stats]
)

Each project has its own Definitions, so they don't interfere with each other.

  1. Discovery: When Dagster starts up, it looks for these Definitions objects to know what assets, jobs, and resources are available to run.

Message from the maintainers

Impacted by this issue? Give it a πŸ‘! We factor engagement into prioritization.

cobrienbeam commented 3 weeks ago

Additionally, maybe there could be a link out to a page that discusses the use of projects vs deployments. I like how in the next and react documentation that it links out to different sections to discuss potential tradeoffs of one selection vs another.

In this discussion of when to use additional projects vs additional deployments:

  1. Security/Compliance Requirements:

Company Infrastructure |── Production Deployment (PCI Compliant) β”‚ └── Financial Projects β”‚ |── payment_processing β”‚ └── customer_billing β”‚ └── Standard Deployment |── Marketing Projects └── Analytics Projects

  1. Resource Isolation:

Infrastructure |── Heavy Computing Deployment (32 CPU, 128GB RAM) β”‚ └── ML Training Projects β”‚ |── model_training β”‚ └── batch_inference β”‚ └── Light Computing Deployment (4 CPU, 16GB RAM) └── ETL Projects |── daily_reports └── data_ingestion

  1. Team/Organization Structure:

Company |── Team A Deployment β”‚ └── Projects with specific permissions/access β”‚ └── Team B Deployment └── Different security groups/access patterns

When teams need complete isolation or different access patterns.

  1. Environment Criticality:

Business Critical Deployment |── Revenue impacting jobs └── Customer-facing data pipelines

Non-Critical Deployment |── Internal analytics └── Experimental projects

  1. Scale/Performance:

Using a single deployment has the following benefits:

And then provide more information on workspaces using the definitions.py files instead of init.py:

You need to explicitly tell Dagster where to find your definitions through the workspace.yaml file:

load_from:
  - python_file: marketing/definitions.py
    location_name: marketing_tools

  - python_file: finance/definitions.py
    location_name: finance_tools
cobrienbeam commented 3 weeks ago

I didn't quite understand the use of the unpacking operator notation in the definition example:

The asterisk * in Python is the "unpacking operator".


# Let's say trip_assets contains these assets:
trip_assets = [taxi_trips, taxi_zones, taxi_trips_file]

# And metric_assets contains:
metric_assets = [revenue_by_day, trips_by_day]

# When you use * it "unpacks" the lists:
defs = Definitions(
    assets=[*trip_assets, *metric_assets]
)

# This is equivalent to writing:
defs = Definitions(
    assets=[
        taxi_trips,
        taxi_zones, 
        taxi_trips_file,
        revenue_by_day,
        trips_by_day
    ]
)

Without the *, you'd get nested lists:


# Without unpacking (WRONG):
defs = Definitions(
    assets=[trip_assets, metric_assets]
)
# This would be like:
assets=[[taxi_trips, taxi_zones], [revenue_by_day]]  # Nested lists!

# With unpacking (CORRECT):
defs = Definitions(
    assets=[*trip_assets, *metric_assets]
)
# This correctly flattens to:
assets=[taxi_trips, taxi_zones, revenue_by_day]  # Flat list!

You'll often see this pattern when you want to combine multiple lists into a single flat list.

It's like saying "take everything out of these lists and put them all together in one new list."

cobrienbeam commented 3 weeks ago

I wish the explanation on os.getenv and EnvVar was a little bit clearer:

With os.getenv:

  1. Start Dagster server
  2. Value of DUCKDB_DATABASE is locked in
  3. Change environment variable
  4. Run asset β†’ still uses old database path
  5. Must restart server to pick up new value

With EnvVar:

  1. Start Dagster server
  2. Run asset β†’ checks DUCKDB_DATABASE value
  3. Change environment variable
  4. Run asset again β†’ uses new database path
  5. No server restart needed!

It's especially useful for:

lydialimlh commented 3 weeks ago

You seem to have understood the unpacking operator of python quite well, you've correctly explained how it works. (I'm just a rando, not from the Dagster team)

cobrienbeam commented 3 weeks ago

You seem to have understood the unpacking operator of python quite well, you've correctly explained how it works. (I'm just a rando, not from the Dagster team)

That was my proposal for the documentation in a callout or side link, etc. regarding the asterisk notation in the example.