Preaggregations calculating wildly wrong numbers, often doubled, at random

benswinburne commented 6 months ago

Apologies for the title, I couldn't work out how to phrase it in a more descriptive manner as the issue i'm encountering does seem random at best.

Describe the bug I have an old cube installation 0.27.30 on a Node server, attempting to upgrade to a Dockerised version 0.34.55, with Cubestore 0.34.55. It's running, but the numbers returned by pre-aggregations are intermittently incorrect - sometimes by double, sometimes an unknown multiplier < 2 (i haven't found any > double examples). I can't find a pattern.

From here in I'll refer to old cube as 0.27.30 (node) and new cube as 0.34.55 (docker w/ cubestore) for ease.

For example

JSON & SQL Queries

``` { "measures": [ "Events.impressions" ], "timeDimensions": [ { "dimension": "Events.dt", "granularity": "day", "dateRange": [ "2023-08-31", "2024-03-31" ] } ], "filters": [ { "member": "Events.creativeId", "operator": "equals", "values": [ "42290d1b-86d4-4ebb-ab1b-xxxxxxxxxxxx" ] } ] } ``` Notice it's querying the `events__performance20240227` preaggregation ``` SELECT `events__dt_day` `events__dt_day`, sum(`events__impressions`) `events__impressions` FROM dev_pre_aggregations.events_performance20240227 AS `events__performance20240227` WHERE (`events__creative_id` = ?) AND ( `events__dt_day` >= to_timestamp(?) AND `events__dt_day` <= to_timestamp(?) ) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000 ```

	new cube	old cube	athena direct	comment
2023-08-31T00:00:00.000	2382	2382	2382
2023-09-01T00:00:00.000	8882	4441	4441	new is double
2023-09-02T00:00:00.000	2998	1499	1499	new is double
2023-09-03T00:00:00.000	6736	3368	3368	new is double
2023-09-04T00:00:00.000	5084	2542	2542	new is double

Another sample from same data set

2024-01-12T00:00:00.000	23005	23005	23005
2024-01-13T00:00:00.000	22514	22514	22514
2024-01-14T00:00:00.000	17995	17995	17995
2024-01-15T00:00:00.000	16142	16142	16142
2024-01-16T00:00:00.000	19533	19533	19533

In the example above, clearly some are doubled.

If however I copy paste the Cube (Call it Events2), delete the preaggregations from the configuration and run the same queries, they all match up.

SQL Query

``` SELECT date_trunc( 'day', CAST( date_add( 'minute', timezone_minute(date_parse(dt, '%Y-%m-%d') AT TIME ZONE 'UTC'), date_add( 'hour', timezone_hour(date_parse(dt, '%Y-%m-%d') AT TIME ZONE 'UTC'), date_parse(dt, '%Y-%m-%d') ) ) AS TIMESTAMP ) ) "events2__dt_day", count( CASE WHEN ("events2".event = 'impression') THEN "events2".event_id END ) "events2__impressions" FROM ( SELECT * FROM events.events WHERE 1 = 1 ) AS "events2" WHERE ( date_parse(dt, '%Y-%m-%d') >= from_iso8601_timestamp(?) AND date_parse(dt, '%Y-%m-%d') <= from_iso8601_timestamp(?) ) AND ("events2".creative_id = ?) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000 ```

	new cube no preagg	new cube	old cube	athena direct
2023-08-31T00:00:00.000	2382	2382	2382	2382
2023-09-01T00:00:00.000	4441	8882	4441	4441
2023-09-02T00:00:00.000	1499	2998	1499	1499
2023-09-03T00:00:00.000	3368	6736	3368	3368
2023-09-04T00:00:00.000	2542	5084	2542	2542

But in other examples, typically across wider ranges the numbers are more varied. I suspect what is happening is some days are doubled which just causes a less obviously identifiable problem across sums of data.

For example

new cube	old cube	athena direct
661352920	412704189	412704189

What makes this really strange, is that if I continue the copy paste the cube to try new things is i can copy my cube to another one (call it Events3) and change nothing, same preaggregations, just a new cube name...

The same dates, same query, returns the correct results. However, there are many other isolated seemingly random occurrences of bad data where the numbers are just wrong.

	new cube (events3)	new cube no preagg (events2)	new cube	old cube	athena direct
2023-08-31T00:00:00.000	2382	2382	2382	2382	2382
2023-09-01T00:00:00.000	4441	4441	8882	4441	4441
2023-09-02T00:00:00.000	1499	1499	2998	1499	1499
2023-09-03T00:00:00.000	3368	3368	6736	3368	3368
2023-09-04T00:00:00.000	2542	2542	5084	2542	2542

All that seems to have happened by me copying the cube is that it built preaggregations again and got this particular set of dates correct this time.

To Reproduce Steps to reproduce the behavior:

I can't consistently reproduce the behaviour. Advice on how I might be able to do this would be appreciated, I can update my comment if anyone has any ideas.

Expected behavior

Preaggregations to return correct numbers.

Minimally reproducible Cube Schema

You can use selects without tables in order to achieve that as follows.

I tried reproducing this with static data per the example in this template but it doesn't seem to happen with a small dataset directly in the cube, only(?) when i'm using Athena with millions of rows.

I'm using

1x cube api
1x cube refresh worker
1x cubestore router
2x cubestore workers

Data source is Athena

cube(`Events3`, {
  data_source: `events`,

  sql: `SELECT * FROM events.events WHERE ${FILTER_PARAMS.Events.dt.filter(
    (from, to) =>
      `date_parse(dt, '%Y-%m-%d') >= from_iso8601_timestamp(${from})
        AND date_parse(dt, '%Y-%m-%d') <= from_iso8601_timestamp(${to})`
  )}`,

  measures: {
    count: {
      type: `count`,
    },

    impressions: {
      type: `count`,
      filters: [{ sql: `${CUBE}.event = 'impression'` }],
    },

    interactions: {
      type: `count`,
      filters: [{ sql: `${CUBE}.event = 'interaction'` }],
    },

    interactionRate: {
      type: `number`,
      sql: `100.0 * ${interactions} / nullif(${impressions}, 0)`,
      format: `percent`,
    },
  },

  dimensions: {
    eventId: {
      sql: `event_id`,
      type: `string`,
      primaryKey: true,
    },

    time: {
      sql: `time`,
      type: `string`,
    },

    event: {
      sql: `event`,
      type: `string`,
    },

    creativeId: {
      sql: `creative_id`,
      type: `string`,
    },

    touchDevice: {
      sql: `touch_device`,
      type: `boolean`,
    },

    dt: {
      sql: `date_parse(dt, '%Y-%m-%d')`,
      type: `time`,
    },
  },

  pre_aggregations: {
    performance: {
      type: `rollup`,
      measures: [impressions, interactions],
      dimensions: [event, creativeId, touchDevice],
      time_dimension: dt,
      granularity: `day`,
      partition_granularity: `month`,
      external: true,
      refresh_key: {
        every: `15 * * * *`,
        incremental: true,
        update_window: `1 day`,
      },
      scheduled_refresh: true,
      build_range_start: {
        sql: `SELECT to_iso8601(cast('2023-01-01 00:00:00.000 UTC' as timestamp))`,
      },
      build_range_end: {
        sql: `SELECT to_iso8601(cast('2024-01-31 23:59:59.000 UTC' as timestamp))`,
      },
    },
  },
});

This cube has been simplified slightly, removing some measures. A couple of counts/sums but unrelated to these queries. I can add them back in with some obfuscation if necessary if anyone thinks they might the cause but there's quite a lot of proprietary bits so it was easiest to remove. For example, here's one - presumably nothing too complicated .

    totalMyDimension: {
      sql: `${myDimension}`,
      type: `sum`,
      filters: [
        {
          sql: `${CUBE}.something = true
              and ${CUBE}.value < x
              and ${CUBE}.value > y
              and ${CUBE}.another = 'z'`,
        },
      ],
    },

Version: 0.34.55

Additional context

Nothing else changed between old cube and new cube except for the cube version and the preagg storage moving over to cubestore

All the raw data is stored in s3 Raw data date (dt) is stored in UTC

paveltiunov commented 5 months ago

@benswinburne I feel we might need data set here to reproduce.

igorlukanin commented 3 months ago

@benswinburne By any chance, would you be able to provide a dataset and help create a repro?

benswinburne commented 3 months ago

Unfortunately not, no. This does not happen with small sample datasets, it only occurs when connected to my full size dataset in Athena, and even in that case it does not produce consistently reproducible effects.

For example if I run it from dates X-Y, across Verticals C and D.

The data on C might be correct, but some (never all) of the dates on D will be incorrect.

If i run the same again (by renaming the preaggregation to force a rebuild).

Some of the dates on C might be incorrect, but D might be completely fine this time around.

The source data is the same, and a parallel running older version of cube pre Cubestore always produces the correct results.

I'm not really in a position to give you access to the data for obvious reasons unfortunately and in some fake datasets i generated locally the problem didn't occur which leads me to believe it's simply the size of the data, or the connection/driver to Athena itself (my attempted data reproductions were local WITH data as x or in Postgres).

waralexrom commented 3 months ago

@benswinburne How many partitions does the query affect (at the cubestore level, these are the tables in the UNION query)? Can you execute the sql query where duplication occurs separately for each partition? - Is the error reproduced within a single partition? The question I'm trying to understand is - is the error at the level of executing the query in cubestore, or at the level of constructing partitions (tables in cubestore) from the source data

cube-js / cube

Preaggregations calculating wildly wrong numbers, often doubled, at random #7854