Request for volume anomalies to pre-aggregate data before computing statistics

garfieldthesam commented 11 months ago

Is your feature request related to a problem? Please describe. I need to run volume anomaly tests on very large tables. However, I cannot performantly do so because the compiled query does not pre-aggregate the row count data. For example this is what the first CTE looks like for a test run on my Databricks cluster:

with monitored_table as (
  select
    *
  from
    <very large table>
  where
    <where predicates>
),

For large tables (especially ones that are very wide), this is prohibitively expensive for 2 reasons:

A select * fetches all table columns, which isn't necessary in principle for creating a time series; this is especially costly for columnar databases
An aggregated time series can be generated from the table, providing a much smaller dataset for later CTEs in the compiled query to apply their transformations/filters to

Describe the solution you'd like I'd like to request a rearchitecture of how the volume anomalies code works to improve performance for large tables. For a volume anomaly test the first CTE may look something like this:

with monitored_table_volume_counts as (
  select
    date_trunc(datepart, timestamp_column) as time_bucket,
    count(1) as num_rows
  from
    <very large table>
  where
    <where predicates>
  group by 
    date_trunc(datepart, timestamp_column)
),

Describe alternatives you've considered We already have a solution, but it is cumbersome and not ideal:

We've had to create derived data quality metrics tables summarizing the large table's metrics each over time. We then run elementary column sum tests on the aggregated metrics.

Additional context Somewhat related to but still distinct from feature request #1158

Would you be willing to contribute this feature? Absolutely willing if I have the bandwidth (though that seems unlikely in the near future).

haritamar commented 6 months ago

Hi @garfieldthesam ! Thanks for opening this issue and sorry about the late response. We've actually implemented since the issue was opened performance optimizations to the volume anomalies test (and others) - so I wonder if the issue is still relevant.

Closing but feel free to re-open.

data-blade commented 4 months ago

please reopen, not fixed (the first CTE should be an aggregated time series, as OP describes) tested with 0.15.2

data-blade commented 4 months ago

fyi @haritamar

IDoneShaveIt commented 3 months ago

Hey @data-blade, sorry for the late response 🙏🏻 I am re-opening this issue, and will talk with the team about it.

elementary-data / elementary

Request for volume anomalies to pre-aggregate data before computing statistics #1329