cube-js / cube

📊 Cube — The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.71k stars 1.75k forks source link

Preaggregations calculating wildly wrong numbers, often doubled, at random #7854

Open benswinburne opened 6 months ago

benswinburne commented 6 months ago

Apologies for the title, I couldn't work out how to phrase it in a more descriptive manner as the issue i'm encountering does seem random at best.

Describe the bug I have an old cube installation 0.27.30 on a Node server, attempting to upgrade to a Dockerised version 0.34.55, with Cubestore 0.34.55. It's running, but the numbers returned by pre-aggregations are intermittently incorrect - sometimes by double, sometimes an unknown multiplier < 2 (i haven't found any > double examples). I can't find a pattern.

From here in I'll refer to old cube as 0.27.30 (node) and new cube as 0.34.55 (docker w/ cubestore) for ease.

For example

JSON & SQL Queries ``` { "measures": [ "Events.impressions" ], "timeDimensions": [ { "dimension": "Events.dt", "granularity": "day", "dateRange": [ "2023-08-31", "2024-03-31" ] } ], "filters": [ { "member": "Events.creativeId", "operator": "equals", "values": [ "42290d1b-86d4-4ebb-ab1b-xxxxxxxxxxxx" ] } ] } ``` Notice it's querying the `events__performance20240227` preaggregation ``` SELECT `events__dt_day` `events__dt_day`, sum(`events__impressions`) `events__impressions` FROM dev_pre_aggregations.events_performance20240227 AS `events__performance20240227` WHERE (`events__creative_id` = ?) AND ( `events__dt_day` >= to_timestamp(?) AND `events__dt_day` <= to_timestamp(?) ) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000 ```
new cube old cube athena direct comment
2023-08-31T00:00:00.000 2382 2382 2382
2023-09-01T00:00:00.000 8882 4441 4441 new is double
2023-09-02T00:00:00.000 2998 1499 1499 new is double
2023-09-03T00:00:00.000 6736 3368 3368 new is double
2023-09-04T00:00:00.000 5084 2542 2542 new is double
Another sample from same data set
2024-01-12T00:00:00.000 23005 23005 23005
2024-01-13T00:00:00.000 22514 22514 22514
2024-01-14T00:00:00.000 17995 17995 17995
2024-01-15T00:00:00.000 16142 16142 16142
2024-01-16T00:00:00.000 19533 19533 19533

In the example above, clearly some are doubled.

If however I copy paste the Cube (Call it Events2), delete the preaggregations from the configuration and run the same queries, they all match up.

SQL Query ``` SELECT date_trunc( 'day', CAST( date_add( 'minute', timezone_minute(date_parse(dt, '%Y-%m-%d') AT TIME ZONE 'UTC'), date_add( 'hour', timezone_hour(date_parse(dt, '%Y-%m-%d') AT TIME ZONE 'UTC'), date_parse(dt, '%Y-%m-%d') ) ) AS TIMESTAMP ) ) "events2__dt_day", count( CASE WHEN ("events2".event = 'impression') THEN "events2".event_id END ) "events2__impressions" FROM ( SELECT * FROM events.events WHERE 1 = 1 ) AS "events2" WHERE ( date_parse(dt, '%Y-%m-%d') >= from_iso8601_timestamp(?) AND date_parse(dt, '%Y-%m-%d') <= from_iso8601_timestamp(?) ) AND ("events2".creative_id = ?) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000 ```
new cube no preagg new cube old cube athena direct
2023-08-31T00:00:00.000 2382 2382 2382 2382
2023-09-01T00:00:00.000 4441 8882 4441 4441
2023-09-02T00:00:00.000 1499 2998 1499 1499
2023-09-03T00:00:00.000 3368 6736 3368 3368
2023-09-04T00:00:00.000 2542 5084 2542 2542

But in other examples, typically across wider ranges the numbers are more varied. I suspect what is happening is some days are doubled which just causes a less obviously identifiable problem across sums of data.

For example

new cube old cube athena direct
661352920 412704189 412704189

What makes this really strange, is that if I continue the copy paste the cube to try new things is i can copy my cube to another one (call it Events3) and change nothing, same preaggregations, just a new cube name...

The same dates, same query, returns the correct results. However, there are many other isolated seemingly random occurrences of bad data where the numbers are just wrong.

new cube (events3) new cube no preagg (events2) new cube old cube athena direct
2023-08-31T00:00:00.000 2382 2382 2382 2382 2382
2023-09-01T00:00:00.000 4441 4441 8882 4441 4441
2023-09-02T00:00:00.000 1499 1499 2998 1499 1499
2023-09-03T00:00:00.000 3368 3368 6736 3368 3368
2023-09-04T00:00:00.000 2542 2542 5084 2542 2542

All that seems to have happened by me copying the cube is that it built preaggregations again and got this particular set of dates correct this time.

To Reproduce Steps to reproduce the behavior:

I can't consistently reproduce the behaviour. Advice on how I might be able to do this would be appreciated, I can update my comment if anyone has any ideas.

Expected behavior

Preaggregations to return correct numbers.

Minimally reproducible Cube Schema

You can use selects without tables in order to achieve that as follows.

I tried reproducing this with static data per the example in this template but it doesn't seem to happen with a small dataset directly in the cube, only(?) when i'm using Athena with millions of rows.

I'm using

Data source is Athena

cube(`Events3`, {
  data_source: `events`,

  sql: `SELECT * FROM events.events WHERE ${FILTER_PARAMS.Events.dt.filter(
    (from, to) =>
      `date_parse(dt, '%Y-%m-%d') >= from_iso8601_timestamp(${from})
        AND date_parse(dt, '%Y-%m-%d') <= from_iso8601_timestamp(${to})`
  )}`,

  measures: {
    count: {
      type: `count`,
    },

    impressions: {
      type: `count`,
      filters: [{ sql: `${CUBE}.event = 'impression'` }],
    },

    interactions: {
      type: `count`,
      filters: [{ sql: `${CUBE}.event = 'interaction'` }],
    },

    interactionRate: {
      type: `number`,
      sql: `100.0 * ${interactions} / nullif(${impressions}, 0)`,
      format: `percent`,
    },
  },

  dimensions: {
    eventId: {
      sql: `event_id`,
      type: `string`,
      primaryKey: true,
    },

    time: {
      sql: `time`,
      type: `string`,
    },

    event: {
      sql: `event`,
      type: `string`,
    },

    creativeId: {
      sql: `creative_id`,
      type: `string`,
    },

    touchDevice: {
      sql: `touch_device`,
      type: `boolean`,
    },

    dt: {
      sql: `date_parse(dt, '%Y-%m-%d')`,
      type: `time`,
    },
  },

  pre_aggregations: {
    performance: {
      type: `rollup`,
      measures: [impressions, interactions],
      dimensions: [event, creativeId, touchDevice],
      time_dimension: dt,
      granularity: `day`,
      partition_granularity: `month`,
      external: true,
      refresh_key: {
        every: `15 * * * *`,
        incremental: true,
        update_window: `1 day`,
      },
      scheduled_refresh: true,
      build_range_start: {
        sql: `SELECT to_iso8601(cast('2023-01-01 00:00:00.000 UTC' as timestamp))`,
      },
      build_range_end: {
        sql: `SELECT to_iso8601(cast('2024-01-31 23:59:59.000 UTC' as timestamp))`,
      },
    },
  },
});

This cube has been simplified slightly, removing some measures. A couple of counts/sums but unrelated to these queries. I can add them back in with some obfuscation if necessary if anyone thinks they might the cause but there's quite a lot of proprietary bits so it was easiest to remove. For example, here's one - presumably nothing too complicated .

    totalMyDimension: {
      sql: `${myDimension}`,
      type: `sum`,
      filters: [
        {
          sql: `${CUBE}.something = true
              and ${CUBE}.value < x
              and ${CUBE}.value > y
              and ${CUBE}.another = 'z'`,
        },
      ],
    },

Version: 0.34.55

Additional context

Nothing else changed between old cube and new cube except for the cube version and the preagg storage moving over to cubestore

All the raw data is stored in s3 Raw data date (dt) is stored in UTC

paveltiunov commented 5 months ago

@benswinburne I feel we might need data set here to reproduce.

igorlukanin commented 3 months ago

@benswinburne By any chance, would you be able to provide a dataset and help create a repro?

benswinburne commented 3 months ago

Unfortunately not, no. This does not happen with small sample datasets, it only occurs when connected to my full size dataset in Athena, and even in that case it does not produce consistently reproducible effects.

For example if I run it from dates X-Y, across Verticals C and D.

The data on C might be correct, but some (never all) of the dates on D will be incorrect.

If i run the same again (by renaming the preaggregation to force a rebuild).

Some of the dates on C might be incorrect, but D might be completely fine this time around.

The source data is the same, and a parallel running older version of cube pre Cubestore always produces the correct results.

I'm not really in a position to give you access to the data for obvious reasons unfortunately and in some fake datasets i generated locally the problem didn't occur which leads me to believe it's simply the size of the data, or the connection/driver to Athena itself (my attempted data reproductions were local WITH data as x or in Postgres).

waralexrom commented 3 months ago

@benswinburne How many partitions does the query affect (at the cubestore level, these are the tables in the UNION query)? Can you execute the sql query where duplication occurs separately for each partition? - Is the error reproduced within a single partition? The question I'm trying to understand is - is the error at the level of executing the query in cubestore, or at the level of constructing partitions (tables in cubestore) from the source data