DataJunction / dj

A metrics platform.
http://datajunction.io
MIT License
33 stars 15 forks source link

Cubes --> Decomposing Metrics #470

Open shangyian opened 1 year ago

shangyian commented 1 year ago

In some sense, cubes in DJ serve as a caching definition -- that is, someone defines a cube and we'll ingest the underlying cube data into Druid/Starrocks/some other OLAP database in order to support low-latency metric queries across all defined dimensions.

However, this ingestion is not straightforward, as we can't always dump a metric with an arbitrary aggregation into such a database directly. Instead, the metric expression needs to be broken down into its constituent measures, each of which needs to be ingested into separate columns and then combined when the user requests metric data, in the post-aggregation operation stage.

A few examples:

Metric ExpressionMeasuresPost-Agg Operation
`SUM(x) AS x` ``` [ { "name": "x", "agg": "sum", "expr": "x" } ] ``` N/A
`AVG(x) AS x` ``` [ { "name": "x_sum", "agg": "sum", "expr": "x" }, { "name": "x_count", "agg": "count", "expr": "x" } ] ``` `x_sum / x_count`
``` CAST( SUM( COALESCE(is_ready, 0.0) ) AS DOUBLE ) / COUNT(*) ``` ``` [ { "agg": "sum", "expr": "CAST(COALESCE(is_ready, 0.0) AS DOUBLE)", "name": "is_ready_ratio_numerator" }, { "agg": "count", "expr": "*", "name": "is_ready_ratio_denominator" } ] ``` `is_ready_ratio_numerator / is_ready_ratio_denominator`
`SUM(a) + SUM(b)` ``` [ { agg: SUM, expr: a name: metric_left }, { agg: SUM, expr: b name: metric_right } ] ``` `metric_left + metric_right`
samredai commented 1 year ago

Being able to decompose the expression into measures makes sense and is probably a useful feature generally (even for display purposes). How does this data structure of a list of maps work out for deeply nested expressions? i.e.

SELECT (
  CAST(
    SUM(
      COALESCE(is_ready, 0.0)
    ) AS DOUBLE
) / COUNT(*) as is_ready_ratio) + SUM(b)

Would that also have a nested measures definition?

[
   {
      "agg":"sum",
      "expr":[
         {
            "agg":"sum",
            "expr":"CAST(COALESCE(is_ready, 0.0) AS DOUBLE)",
            "name":"is_ready_ratio_numerator"
         },
         {
            "agg":"count",
            "expr":"*",
            "name":"is_ready_ratio_denominator"
         }
      ],
      "name":"is_ready_ratio"
   },
   {
      "agg":"SUM",
      "expr":"b",
      "name":"metric_right"
   }
]

It seems there may be two ways we can go here. We can try to represent very nested expressions (I'm curious to what extent various real-time analytics databases will support complex aggregation expressions). Alternatively we can add more restrictions to the SQL definition for metrics, i.e. only allow simple aggregation expressions f(col) [*|/|+|-] f(col). More complicated operations can then be pushed to the upstream transform(s).

shangyian commented 1 year ago

@samredai actually because we're doing this decomposition into measures with only simple aggregation expressions, we should be able to plug into any real-time analytics database. Something very nested will be decomposed until we reach its simplest aggregations, or if there's some unsupported aggregation, we'll raise an error. Then in the post-agg operation, we'll stitch those nested measures together.

For example, let's say you have something like:

(SUM(a) + SUM(b))/(COUNT(c))

This would decompose into:

{
  "measures": [
    {"name": "sum_a", "expr": "a", "agg": "sum"}, 
    {"name": "sum_b", "expr": "b", "agg": "sum"}, 
    {"name": "count_a", "expr": "c", "agg": "count"}
  ]
}

And in the post-agg operation, we evaluate the measures with (essentially this operation represents the nested structure):

(sum_a + sum_b) / count_c

The analytics database only cares about the measures, all of which have been reduced to simple aggregations.

Also I should clarify that in the example above, CAST(COALESCE(is_ready, 0.0) AS DOUBLE) is not actually something we want the OLAP database to execute, rather, the OLAP db only cares about the agg sum. CAST(COALESCE(is_ready, 0.0) AS DOUBLE) is only relevant for calculating the column in the upstream table that prepares the cube for ingestion into the OLAP database.

samredai commented 1 year ago

I see, thanks for that explanation! It makes sense to me then to lean into that interface set by analytics databases. 👍

agorajek commented 1 year ago

@shangyian I love this... and I think that your approach of Something very nested will be decomposed until we reach its simplest aggregations, or if there's some unsupported aggregation, we'll raise an error. will work well most of the time.

I am curious to see how you keep the recipes for aggs like AVG, but I guess your PR will tell me in a moment.