DataJunction / dj

A metrics platform.
http://datajunction.io
MIT License
35 stars 15 forks source link

Allow the metric expression to include columns that come from linked Dimension node #1188

Open anhqle opened 2 weeks ago

anhqle commented 2 weeks ago

Currently, metric expression can only include columns from the Transform node.

Example of why this would be useful:

  1. user_latest_status = MAX_BY(user_status, time_column_from_dim_node)
  2. engagement_on_android = SUM(engagement) FILTER (WHERE device_name_from_dim_node = 'android')
samredai commented 2 weeks ago

Hmm, that would mean the column from the dim_node is required and so it's reasonable to require setting dim_node as a required dimension beforehand right? That would help limit the validation so we don't have to search the entire dimension dag to find this column. We could just look at the attributes (columns) on the dimensions found in the required dimensions for that metric.

Which brings me to another question. What if multiple dimension nodes have that particular column name? Do we just fail validation or SQL generation and raise some error indicating an ambiguous value in the metric definition?

anhqle commented 2 weeks ago

that would mean the column from the dim_node is required and so it's reasonable to require setting dim_node as a required dimension beforehand right?

If by required dimension, you mean colloquially that the dim node must be joined for the metric to be computable, then yes, it is required.

If by required dimension you mean the DJ-definition i.e. "a grain at which the metric has to be consumed" (this name is frankly still unintuitive to me...), then no. In the engagement_on_android example, that is a metric that can be aggregated at various grain, not limited to device_name_from_dim_node.

What if multiple dimension nodes have that particular column name? Do we just fail validation or SQL generation and raise some error indicating an ambiguous value in the metric definition?

Failing seems fine to me. Requiring the user to write the full column name could also be okay, except that it'd be weird for the Transform's column to not be fully qualified, but the Dimension's column is.