Open garfieldthesam opened 11 months ago
Hi @garfieldthesam ! Thanks for opening this issue and sorry about the late response. We've actually implemented since the issue was opened performance optimizations to the volume anomalies test (and others) - so I wonder if the issue is still relevant.
Closing but feel free to re-open.
please reopen, not fixed (the first CTE should be an aggregated time series, as OP describes) tested with 0.15.2
fyi @haritamar
Hey @data-blade, sorry for the late response 🙏🏻 I am re-opening this issue, and will talk with the team about it.
Is your feature request related to a problem? Please describe. I need to run volume anomaly tests on very large tables. However, I cannot performantly do so because the compiled query does not pre-aggregate the row count data. For example this is what the first CTE looks like for a test run on my Databricks cluster:
For large tables (especially ones that are very wide), this is prohibitively expensive for 2 reasons:
select *
fetches all table columns, which isn't necessary in principle for creating a time series; this is especially costly for columnar databasesDescribe the solution you'd like I'd like to request a rearchitecture of how the volume anomalies code works to improve performance for large tables. For a volume anomaly test the first CTE may look something like this:
Describe alternatives you've considered We already have a solution, but it is cumbersome and not ideal:
We've had to create derived data quality metrics tables summarizing the large table's metrics each over time. We then run elementary
column
sum tests on the aggregated metrics.Additional context Somewhat related to but still distinct from feature request #1158
Would you be willing to contribute this feature? Absolutely willing if I have the bandwidth (though that seems unlikely in the near future).