Open benswinburne opened 6 months ago
@benswinburne I feel we might need data set here to reproduce.
@benswinburne By any chance, would you be able to provide a dataset and help create a repro?
Unfortunately not, no. This does not happen with small sample datasets, it only occurs when connected to my full size dataset in Athena, and even in that case it does not produce consistently reproducible effects.
For example if I run it from dates X-Y, across Verticals C and D.
The data on C might be correct, but some (never all) of the dates on D will be incorrect.
If i run the same again (by renaming the preaggregation to force a rebuild).
Some of the dates on C might be incorrect, but D might be completely fine this time around.
The source data is the same, and a parallel running older version of cube pre Cubestore always produces the correct results.
I'm not really in a position to give you access to the data for obvious reasons unfortunately and in some fake datasets i generated locally the problem didn't occur which leads me to believe it's simply the size of the data, or the connection/driver to Athena itself (my attempted data reproductions were local WITH data as x
or in Postgres).
@benswinburne How many partitions does the query affect (at the cubestore level, these are the tables in the UNION query)? Can you execute the sql query where duplication occurs separately for each partition? - Is the error reproduced within a single partition? The question I'm trying to understand is - is the error at the level of executing the query in cubestore, or at the level of constructing partitions (tables in cubestore) from the source data
Apologies for the title, I couldn't work out how to phrase it in a more descriptive manner as the issue i'm encountering does seem random at best.
Describe the bug I have an old cube installation 0.27.30 on a Node server, attempting to upgrade to a Dockerised version 0.34.55, with Cubestore 0.34.55. It's running, but the numbers returned by pre-aggregations are intermittently incorrect - sometimes by double, sometimes an unknown multiplier < 2 (i haven't found any > double examples). I can't find a pattern.
From here in I'll refer to old cube as 0.27.30 (node) and new cube as 0.34.55 (docker w/ cubestore) for ease.
For example
JSON & SQL Queries
``` { "measures": [ "Events.impressions" ], "timeDimensions": [ { "dimension": "Events.dt", "granularity": "day", "dateRange": [ "2023-08-31", "2024-03-31" ] } ], "filters": [ { "member": "Events.creativeId", "operator": "equals", "values": [ "42290d1b-86d4-4ebb-ab1b-xxxxxxxxxxxx" ] } ] } ``` Notice it's querying the `events__performance20240227` preaggregation ``` SELECT `events__dt_day` `events__dt_day`, sum(`events__impressions`) `events__impressions` FROM dev_pre_aggregations.events_performance20240227 AS `events__performance20240227` WHERE (`events__creative_id` = ?) AND ( `events__dt_day` >= to_timestamp(?) AND `events__dt_day` <= to_timestamp(?) ) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000 ```In the example above, clearly some are doubled.
If however I copy paste the Cube (Call it Events2), delete the preaggregations from the configuration and run the same queries, they all match up.
SQL Query
``` SELECT date_trunc( 'day', CAST( date_add( 'minute', timezone_minute(date_parse(dt, '%Y-%m-%d') AT TIME ZONE 'UTC'), date_add( 'hour', timezone_hour(date_parse(dt, '%Y-%m-%d') AT TIME ZONE 'UTC'), date_parse(dt, '%Y-%m-%d') ) ) AS TIMESTAMP ) ) "events2__dt_day", count( CASE WHEN ("events2".event = 'impression') THEN "events2".event_id END ) "events2__impressions" FROM ( SELECT * FROM events.events WHERE 1 = 1 ) AS "events2" WHERE ( date_parse(dt, '%Y-%m-%d') >= from_iso8601_timestamp(?) AND date_parse(dt, '%Y-%m-%d') <= from_iso8601_timestamp(?) ) AND ("events2".creative_id = ?) GROUP BY 1 ORDER BY 1 ASC LIMIT 10000 ```But in other examples, typically across wider ranges the numbers are more varied. I suspect what is happening is some days are doubled which just causes a less obviously identifiable problem across sums of data.
For example
What makes this really strange, is that if I continue the copy paste the cube to try new things is i can copy my cube to another one (call it Events3) and change nothing, same preaggregations, just a new cube name...
The same dates, same query, returns the correct results. However, there are many other isolated seemingly random occurrences of bad data where the numbers are just wrong.
All that seems to have happened by me copying the cube is that it built preaggregations again and got this particular set of dates correct this time.
To Reproduce Steps to reproduce the behavior:
I can't consistently reproduce the behaviour. Advice on how I might be able to do this would be appreciated, I can update my comment if anyone has any ideas.
Expected behavior
Preaggregations to return correct numbers.
Minimally reproducible Cube Schema
I tried reproducing this with static data per the example in this template but it doesn't seem to happen with a small dataset directly in the cube, only(?) when i'm using Athena with millions of rows.
I'm using
Data source is Athena
This cube has been simplified slightly, removing some measures. A couple of counts/sums but unrelated to these queries. I can add them back in with some obfuscation if necessary if anyone thinks they might the cause but there's quite a lot of proprietary bits so it was easiest to remove. For example, here's one - presumably nothing too complicated .
Version: 0.34.55
Additional context
Nothing else changed between old cube and new cube except for the cube version and the preagg storage moving over to cubestore
All the raw data is stored in s3 Raw data date (dt) is stored in UTC