cube-js / cube

📊 Cube — The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.69k stars 1.75k forks source link

[Feature request] Pre-aggregation within window #8185

Open itestyoy opened 4 months ago

itestyoy commented 4 months ago

Hi! Feature Request: Add an option for pre-aggregation, similar to window functions. For example, over the last 90 days, and calculate pre-aggregation only for this period. For calculations outside of this period, use a non-pre-aggregated query. This is useful when one part of the query involves slowly changing dimensions and when users typically analyze data for a specific period

igorlukanin commented 3 months ago

Hi @itestyoy 👋

Thanks for the suggestion!

Currently, you can use build_range_start and build_range_end options to specify the date range for a pre-aggregation to target. However, Cube will not return results outside of the defined build range, even when lambda pre-aggregations are used.

I was actually hoping that lambda pre-aggregations would allow to query data "earlier than in the rollup" but apparently they only allow to query "later than in the rollup":

cubes:
  - name: build_range
    sql: >
      SELECT 1 AS value, '2024-01-01'::TIMESTAMP AS time UNION ALL
      SELECT 2 AS value, '2024-02-10'::TIMESTAMP AS time UNION ALL
      SELECT 3 AS value, '2024-03-20'::TIMESTAMP AS time

    dimensions:
      - name: time
        sql: time
        type: time

    measures:
      - name: value_sum
        sql: value
        type: sum

    pre_aggregations:
      - name: lambda
        type: rollup_lambda
        build_range_start:
          sql: SELECT '2024-01-01'::TIMESTAMP
        union_with_source_data: true
        rollups:
          - main

      - name: main
        measures:
          - value_sum
        time_dimension: time
        granularity: day
        partition_granularity: day
        build_range_start:
          sql: SELECT '2024-02-01'::TIMESTAMP
        refresh_key:
          every: 1 hour

Query:

{
  "limit": 5000,
  "measures": [
    "build_range.value_sum"
  ],
  "timeDimensions": [
    {
      "dateRange": [
        "2024-01-01",
        "2024-01-31"
      ],
      "dimension": "build_range.time",
      "granularity": "day"
    }
  ]
}

The SQL looks promising but still would return am empty result set:

SELECT
  `build_range__time_day` `build_range__time_day`,
  sum(`build_range__value_sum`) `build_range__value_sum`
FROM
  (
    SELECT
      *
    FROM
      prod_pre_aggregations.build_range_main20240320_nv14fex1_rb5yafsp_1j43vsf
    UNION ALL
    SELECT
      *
    FROM
      lambda_prod_pre_aggregations_build_range_main
  ) AS `build_range__lambda`
WHERE
  (
    `build_range__time_day` >= to_timestamp('2024-01-01T00:00:00.000')
    AND `build_range__time_day` <= to_timestamp('2024-01-31T23:59:59.999')
  )
GROUP BY
  1
ORDER BY
  1 ASC
LIMIT
  5000

Let me tag @paveltiunov to see what he thinks might be a solution for this use case.

mauriciocirelli commented 1 month ago

Hi!

I am interested in this solution also. I think the pre-aggregation matching algorithm should consider the build ranges... If the query falls outside the build range, then send it to the upstream data source.