DataJunction / dj

A metrics platform.
http://datajunction.io
MIT License
35 stars 15 forks source link

Solutions for dimensions as of event time #385

Open CircArgs opened 1 year ago

CircArgs commented 1 year ago

In this issue I hope to convey a potential solution to discuss the issue of leveraging dimensions that have something like an event time DJ would need to account for on behalf of a metric user.

Here is an initial proposal

Take this example where event_ts is the time an event occurred and proc_ts is when the data was ingested/processed

image

this would be the query that would define Country_Latest

SELECT country_id,
       country_name,
       region_id,
       proc_ts
FROM   country_snapshot
WHERE  snapshot_ts = (SELECT Max(snapshot_ts)
                  FROM   country_snapshot) 

Country_Latest would be a "derived dimension" by which we mean it adheres to rules such as:

DJ would then allow such derived dimensions to be used in place of their parents. For example, as seen in the diagram above, Likes points to Country_Snapshot.country_id and Country_Snapshot.snapshot_ts and in doing so would also allow derived dimensions to be used on the remaining keys which for Country_Latest is Country_Latest.country_id. DJ would easily be able to make the assessment based on the criterion above to determine a dimension is derived and when a user requests a derived dimension DJ will be able to trace through it to the underlying dimension to use it instead.

To summarize, the main point here is that derived dimensions could drop parent dimension join keys and DJ is capable of managing this (without too much effort to develop such a feature I believe)

CircArgs commented 1 year ago

@samredai @agorajek @shangyian @betodealmeida

agorajek commented 1 year ago

Hey @CircArgs thanks for starting this topic. I think we are almost there, but let me add few more things we should consider in this context. But first, let me raise my main concern. I think that in the data warehousing world we would want both country_latest and country_snapshot dimensions to be living in under the same roof. These are just two different ways to get the attributes of the same dimension. Apart from that you don't want some one to use these suffixes _latest etc to be accidentally used for a different reason.

Now onto the missing pieces of the puzzle...

Let's consider the country_snapshot dimension join. When user asks DJ to use this dimension in a cube they will be expecting that their cube will wait with a daily materialization until the matching snapshot partition of that dimension is available to join with the fact table.

Similarly if the user chooses a country_latest they will likely want that latest be not older than some number of days, in case the dimension pipeline got stuck and its not being refreshed for many days. So typically they would say: give me the latest partition of the country dimension, but not older than X days.

Considering all that my little adjustments to the proposal would be to use some metadata attributes on the dimension nodes to describe these case, rather than new dimension nodes.

For example, country dimension node would have (pseudocode):

Then we could expose these join methods as modifications to the Dimension node that could be used in place of dimension nodes wherever they are accessed. So for example in the cube building API, we could say:

samredai commented 1 year ago

@CircArgs, @agorajek, and I discussed this extensively Friday evening and I believe we got very close to a solution that is both generic and intuitively applicable to solving the “as of event time” vs “as of right now” problem.

What this problem boils down to is that there are optional joins (usually of temporal columns) that are meant to pull in dimension attributes in a variety of ways. As an example, let's say there’s a system where users can change their username and we want to check out the number of logins per user. If the user dimension node is a type 1 slowly changing dimension (SCD1), this is a walk in the park for DJ—the dimension node’s data only contains the latest username for each user and the design doesn't allow looking back at what the username was at a previous point in time.

But suppose it's stored in a type 2 slowly changing dimension (SCD2) table. Each time the username is changed, an entry is appended to the user table for that user id, that contains the new username, and a date of when the change was made effective and when it expired (assume here only a single name change per day is allowed).

user (dimension node) user_id username effective_date expiration_date
1 ks 20221225 20230301
1 kennyS 20230302 99991231
2 GeT_RiGhT 20220716 99991231
3 kovacs 20220101 20220331
3 lkovacs 20220401 20221230
3 GuardiaN 20221231 99991231

And here's a simple fact table that tracks logins...

logins ([source transform] node) user_id login_date
1 20221231
1 20230303
1 20230303
2 20220819
2 20221231
2 20230101
3 20220105
3 20220402
3 20230214

Outside of DJ, you have two main options:

  1. If you just want the grouping to use the user's current username, you have to take the latest set of values from the dimension table for each distinct primary key value.

    select count(l.login_date) as num_logins, u.username
    from logins l
    left join users u
    on l.user_id = u.user_id
    and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
    group by u.username;
    results num_logins username
    3 GeT_RiGhT
    3 GuardiaN
    3 kennyS
  2. If you want the grouping to use the user's username as of the day they logged in, you have to use a different join condition where the event time column on the transform node (in this case login_date) is between the start and end time columns on the dimension node (in this case effective_date and expired_date).

    select count(l.login_date) as num_logins, u.username
    from logins l
    left join users u
    on l.user_id = u.user_id
    and l.login_date between u.effective_date and u.expired_date
    group by u.username;
    results num_logins username
    3 GeT_RiGhT
    1 GuardiaN
    2 kennyS
    1 kovacs
    1 ks
    1 lkovacs

Here's the diff between the two queries:

select count(l.login_date) as num_logins, u.username
from logins l
left join users u
on l.user_id = u.user_id
- and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
+ and l.login_date between u.effective_date and u.expired_date
group by u.username;

The solution (which I'll describe in a sec 😄) is that since these are really just two different very common join patterns, they are easily producible if we had a way to know which columns from the nodes to plug in. Here are the two patterns above, illustrated using representative names for the keys.

- and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
+ and <effective_time> = (select max(<effective_time>) from users u2 where u.user_id = u2.user_id)
- and l.login_date between u.effective_date and u.expired_date
+ and <event_time> between <effective_time> and <expired_time>

When building a cube, we really just need to know which column on each metric represents the event_time and which columns on each dimension represents the effective_time and expired_time. If DJ included a system to define globally unique column tags, they can be utilized to apply these common patterns when the columns are properly tagged. This would be a generic column tagging apparatus that we can even open up to for users to utilize (although there are more things to consider there).

To solve the examples in this comment, those three global column keys can be created and columns on the nodes can be tagged appropriately. Then, during a request, users can add an optional arg that provides an array of keys to use. The flow would look like this:

  1. On the user dimension node, the effective_date and expired_date columns will be tagged with the effective_time and expired_time global keys, respectively
  2. On the logins transform node, the login_date column will be tagged with the event_time global key
  3. When a user makes a request similar to: GET /metrics/num_logins/sql/?dimensions=user.username...DJ can default to:
    and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
  4. A user can get the same result by explicitly telling DJ to utilize the effective_time global key: GET /metrics/num_logins/sql/?dimensions=user.username&global_keys=effective_time (effective time is what's used to get the latest value for each distinct primary key)
  5. A user can also make a request to DJ for the dimension values as of the day they logged in by telling it to utilize the event_time, effective_time, and expired_time global keys: GET /metrics/num_logins/sql/?dimensions=user.username&global_keys=event_time,effective_time,expired_time...DJ would then use:
    and l.login_date between u.effective_date and u.expired_date

The directions for users would be straightforward. First use the global keys to categorize the temporal columns on the nodes. Then either use the default behavior of only using the latest dimension attributes, or alternatively add global_keys=event_time,effective_time,expired_time to the request to get the dimension attributes as of the event time.

I captured the tables and queries in the above example as a gist that should work in sqlite. https://gist.github.com/samredai/3d5f2294e6b364ada6ebbf559a1e4e68

shangyian commented 1 year ago

This is great! On the subject of column tagging, I just added an issue that's slightly related on adding column annotations to let users explicitly annotate their dimensions (https://github.com/DataJunction/dj/issues/386). I think we can roll these concepts in together -- i.e., in this case, someone would add column annotations marking the columns that represent the different temporal options, but they could use this same system to mark columns as dimensions. (With likely plenty more metadata to come...)

samredai commented 1 year ago

@shangyian yes! Admittedly I was dragged towards taking a more generic approach to this by @CircArgs but it completely makes sense to me now. Thanks for tying in another big use case. 😄

agorajek commented 1 year ago

Nice to see all these ideas coming together. @shangyian I agree that tagging of columns with these "special" attributes should be a generic feature. I also think it would be good to try and define some generic functionality around them. What do you all think about this:

Column tagging with global/system tags:

  1. List of predefined global tags. Could be hardcoded initially and configurable later, e.g. (per @samredai ) event_time, effective_time, expired_time. These tags could have a basic constraint support, mostly to enforce uniqueness at different DJ levels, e.g:
    global_tags:
      event_time:
          level: column,          
          description: Points to a column which represents the time of the event in a given fact related node. Used to facilitate proper joins with dimension node to match the desired effect.
          uniqueness_scope: 
             - node
             - column type # we could allow `event_time` to be marked on more than one column in a node as long as the data type is diferent
          # potentially other rules could be offered here, e.g. `allowed_node_types: dimension`
      },
     ...
    }
  2. Users can see the list and definitions of these global tags. Then they can go and add them to their nodes.
  3. DJ could then have predefined ways to join nodes based on these global/system tags and in order to connect these two worlds (tags and standard joins) we could define the join types based on these global tags, so e.g.:
    • join_type of latest would not require any event_type tag info on the dimension side, but if one exists DJ would know how to collapse the dim node before the join.
    • join_type of as-of would require event_date to be present on the metric side, and some combination of event_time,effective_time,expired_time to be present on the dimension side.

@samredai I think the difference between my above suggestion and your global_keys=event_time,effective_time,expired_time is small but I think the cube-making/metric-reading user will have an easier time telling us what type of join to use, rather than calling out the keys/tags to utilize, because the list of tags/keys on its own does not define a method to use them in a join. Its a subtle thing so would love to chat about it when we talk next week.

CircArgs commented 1 year ago

@shangyian I think I was saying the same thing elsewhere to @samredai about how the keys are so generic they could also have been a route to the same functionality as how we reference dimensions nowadays with the caveat that that means you are also tagging the dimensions if using keys. Is that what you mean also?

samredai commented 1 year ago

@agorajek all of that makes sense to me. I was heading towards the idea of easily selectable join types using a pre-defined and required set of global keys but my comment was already getting way too long. ;)

CircArgs commented 1 year ago

image