cube-js / cube

📊 Cube — The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.95k stars 1.78k forks source link

Support window functions in Cube Store #8932

Open ChloeBellm opened 1 week ago

ChloeBellm commented 1 week ago

Describe the bug I think this is a bug based on the error message but if this isn't supported would love to hear workaround suggestions as this is currently preventing us from using pre aggregations.

I'm expecting to create a pre aggregation with column A. I then want to query column B, which is a measure based on a window function on A but get the following error:

Internal: Internal error: unsupported operation. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

The use case for this is to get a "totals" row, so my window function is just sum(sum(A)) over ()

To Reproduce Steps to reproduce the behavior:

  1. Create cube with columns A and B, where B is a measure defined as:
    total: {
          sql: (CUBE) => `sum(${CUBE[`A`]}) over ()`,
          type: `number`,
          title: `A Total`,
    }
  2. Create a rollup pre aggregation with A as the only measure, no dimensions
  3. Try querying B, and it generates the following SQL:
SELECT
  sum(`my_cube__A`) `my_cube__A`,
  sum(sum(`my_cube__A`)) over () `my_cube__B`
FROM
  prod_pre_aggregations.my_cube_rollup AS `my_cube__rollup`

This is correct but throws the error above.

Expected behavior I expected it to run the SQL query above and give me a total column as column B.

Example expected output: A | B 1 | 12 3 | 12 8 | 12

Where B is the total of everything in column A. We should assume filters have been applied, which is why this needs to be calculated after pre aggregations created.

Screenshots Error on playground:

Screenshot 2024-11-08 at 11 29 51

Minimally reproducible Cube Schema

cube(`Orders`, {
  sql: `
  select 1 as id, 100 as amount, 'new' status
  UNION ALL
  select 2 as id, 200 as amount, 'new' status
  UNION ALL
  select 3 as id, 300 as amount, 'processed' status
  UNION ALL
  select 4 as id, 500 as amount, 'processed' status
  UNION ALL
  select 5 as id, 600 as amount, 'shipped' status
  `,
  measures: {
    totalAmount: {
      sql: `amount`,
      type: `sum`,
    },
    grandTotalAmount: {
          sql: `sum(${CUBE[`totalAmount`]}) over ()`,
          type: `number`,
    },
  },
  dimensions: {
    status: {
      sql: `status`,
      type: `string`,
    },
  },
});

Version: 0.36.7

Additional context If there are any other suggestions as to how to do totals (and sub totals) we would love to hear these too!

igorlukanin commented 1 week ago

Thanks for a very elaborate question @ChloeBellm 🙌

A couple of points here:

I hope this helps.

ChloeBellm commented 1 week ago

Thanks @igorlukanin! Good to hear this is on the roadmap.

How would running a query with fewer dimensions work if we want to apply measure filters at the dimension breakdown the user has selected?

For example, if we have some data: A | B apple | 20 orange | 30

Where A is a dimension, and B is a "sum" measure.

First query asks for dimension A and measure B with a filter on B > 25 and this returns just orange. The total we'd like to show is therefore 30. The second "totals" query would just ask for B with the same filter for B > 25? If we did this the filter would perform a check of 50 > 25 which is true so it would return 50 as the total, which is not correct.

Please let me know if I should open a new issue for this question.

igorlukanin commented 1 week ago

@ChloeBellm Oh, I see now. You have a measure filter, and this basically renders my "less dimensions" workaround useless.

How do you plat to consume the data? It looks like you might need to calculate totals on the client side then—until we get either multi-stage cals or window functions support in Cube Store.

ChloeBellm commented 1 week ago

@igorlukanin we want to show users a table of data and a totals row. The main challenge is with calculated measures as we have measures that are one dimension divided by the other, for example, so to calculate this client side we'd have to specify how those calculations should work. We had actually done this previously but we have many metrics and defining the logic for those in two places is not ideal! We also use Cube's pagination which means the total in the data returned wouldn't necessarily be the whole total. Another idea I had was to join a separate Cube with FILTER PARAMS, but I'm not sure I can see how this would work with pre aggregations?