cube-js / cube

📊 Cube — Universal semantic layer platform for AI, BI, spreadsheets, and embedded analytics
https://cube.dev
Other
17.96k stars 1.78k forks source link

Cube Cloud failing pre-aggregation warm up - requirements.txt no running? #8923

Open johache opened 2 weeks ago

johache commented 2 weeks ago

Describe the bug Using Cube Cloud, I think there might be something wrong with the pre-aggregation warm up instances.

To Reproduce Steps to reproduce the behavior:

  1. Define your requirements.txt to install databricks-sdk
    databricks-sdk
  2. Define a scheduled_refresh_contexts which depends on databricks in cube.py
    
    from cube import config
    from databricks.sdk import WorkspaceClient

...

@config('scheduled_refresh_contexts') def scheduled_refresh_contexts() -> list[object]: databricks_workspace_client = WorkspaceClient( host = os.environ.get('DATABRICKS_HOST'), token = os.environ.get('CUBEJS_DB_DATABRICKS_TOKEN') )

# Fetch the list of schemas within the environment's catalog
catalog_name = os.environ.get('CUBEJS_DB_DATABRICKS_CATALOG')
schemas = databricks_workspace_client.schemas.list(catalog_name=catalog_name)

# ...
return security_contexts_array
3. Enable pre-aggregation warm up in cube cloud

**Expected behavior**
- dependencies from requirements.txt get installed before any instance run
- After the env vars update on cube cloud, all contexts defined by scheduled_refresh_contexts should compile and pre-aggregate, any query hitting a pre-aggregation should pass

**Actual behavior**
- This runs fine on my worker on API instances, but not on my pre-aggregation warm up instances
- It's a little but hard to debug, because the pre-aggregation warm up instance only seems to exist for a fraction of a second, but when I do catch it, it says that databricks-sdk is not installed
- I can definitely see that at least in my build job, the databricks.sdk is installed
- The result is that NO pre-aggregations get built, unless the refresh_key triggers it, which can take time and leave the instance broken for extended periods of time

**Screenshots**
![Screenshot 2024-10-18 at 1 12 42 PM](https://github.com/user-attachments/assets/647d3515-e6a8-4775-9eac-34034051f64a)
![Screenshot 2024-10-18 at 1 14 27 PM](https://github.com/user-attachments/assets/9e984334-47c0-409b-b739-45af6399f6c1)
![Screenshot 2024-10-18 at 1 15 44 PM](https://github.com/user-attachments/assets/fc114203-b2b6-42c7-be1c-063d782b84d2)

**Minimally reproducible Cube Schema**
Adding a cut out from my schema, but I don't think this is schema dependent. The important part is the requirements.txt and cube.py posted above

```yaml
cubes:
  - name: gold_journal_lines
    sql_table: "{{ COMPILE_CONTEXT.securityContext.company_id | safe }}.gold__journal_lines"

    dimensions:
      - name: id
        sql: id
        type: string
        primary_key: true
      - name: net_amount
        sql: net_amount
        type: number
      - name: posted_on
        sql: posted_on
        type: time
    measures:
      - name: sum_net_amount
        type: sum
        sql: net_amount

    pre_aggregations:
      # Rollup Pre-aggregation with accounts and counterparties
      - name: journal_line_acc_cpt_rollup
        measures:
          - gold_journal_lines.sum_net_amount
        time_dimension: CUBE.posted_on
        granularity: month
        partition_granularity: year

Version: Tried with 0.35.55, 1.0.1, 1.1.0

Happy to provide any additional details