Open CircArgs opened 1 year ago
@samredai @agorajek @shangyian @betodealmeida
Hey @CircArgs thanks for starting this topic. I think we are almost there, but let me add few more things we should consider in this context. But first, let me raise my main concern. I think that in the data warehousing world we would want both country_latest
and country_snapshot
dimensions to be living in under the same roof. These are just two different ways to get the attributes of the same dimension. Apart from that you don't want some one to use these suffixes _latest
etc to be accidentally used for a different reason.
Now onto the missing pieces of the puzzle...
Let's consider the country_snapshot
dimension join. When user asks DJ to use this dimension in a cube they will be expecting that their cube will wait with a daily materialization until the matching snapshot partition of that dimension is available to join with the fact table.
Similarly if the user chooses a country_latest
they will likely want that latest be not older than some number of days, in case the dimension pipeline got stuck and its not being refreshed for many days. So typically they would say: give me the latest partition of the country
dimension, but not older than X days.
Considering all that my little adjustments to the proposal would be to use some metadata attributes on the dimension nodes to describe these case, rather than new dimension nodes.
For example, country
dimension node would have (pseudocode):
sql
: dimension query to infer the schema,pk
: to indicate the dimension keys say: country_id, snapshot_date
join methods
:
SELECT * FROM <self>
SELECT * FROM <self> WHERE snapshot_date = (SELECT Max(snapshot_date) FROM country_snapshot
SELECT *, snapshot_date AS as_of_date FROM <self>
Then we could expose these join methods as modifications to the Dimension node that could be used in place of dimension nodes wherever they are accessed. So for example in the cube building API, we could say:
m1
, m2
, m3
d1@latest
, d2@as_of_date
, d3[@default]
@CircArgs, @agorajek, and I discussed this extensively Friday evening and I believe we got very close to a solution that is both generic and intuitively applicable to solving the “as of event time” vs “as of right now” problem.
What this problem boils down to is that there are optional joins (usually of temporal columns) that are meant to pull in dimension attributes in a variety of ways. As an example, let's say there’s a system where users can change their username and we want to check out the number of logins per user. If the user dimension node is a type 1 slowly changing dimension (SCD1), this is a walk in the park for DJ—the dimension node’s data only contains the latest username for each user and the design doesn't allow looking back at what the username was at a previous point in time.
But suppose it's stored in a type 2 slowly changing dimension (SCD2) table. Each time the username is changed, an entry is appended to the user table for that user id, that contains the new username, and a date of when the change was made effective and when it expired (assume here only a single name change per day is allowed).
user (dimension node) | user_id | username | effective_date | expiration_date |
---|---|---|---|---|
1 | ks | 20221225 | 20230301 | |
1 | kennyS | 20230302 | 99991231 | |
2 | GeT_RiGhT | 20220716 | 99991231 | |
3 | kovacs | 20220101 | 20220331 | |
3 | lkovacs | 20220401 | 20221230 | |
3 | GuardiaN | 20221231 | 99991231 |
And here's a simple fact table that tracks logins...
logins ([source | transform] node) | user_id | login_date |
---|---|---|---|
1 | 20221231 | ||
1 | 20230303 | ||
1 | 20230303 | ||
2 | 20220819 | ||
2 | 20221231 | ||
2 | 20230101 | ||
3 | 20220105 | ||
3 | 20220402 | ||
3 | 20230214 |
Outside of DJ, you have two main options:
If you just want the grouping to use the user's current username, you have to take the latest set of values from the dimension table for each distinct primary key value.
select count(l.login_date) as num_logins, u.username
from logins l
left join users u
on l.user_id = u.user_id
and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
group by u.username;
results | num_logins | username |
---|---|---|
3 | GeT_RiGhT | |
3 | GuardiaN | |
3 | kennyS |
If you want the grouping to use the user's username as of the day they logged in, you have to use a different join condition where the event time column on the transform node (in this case login_date) is between the start and end time columns on the dimension node (in this case effective_date and expired_date).
select count(l.login_date) as num_logins, u.username
from logins l
left join users u
on l.user_id = u.user_id
and l.login_date between u.effective_date and u.expired_date
group by u.username;
results | num_logins | username |
---|---|---|
3 | GeT_RiGhT | |
1 | GuardiaN | |
2 | kennyS | |
1 | kovacs | |
1 | ks | |
1 | lkovacs |
Here's the diff between the two queries:
select count(l.login_date) as num_logins, u.username
from logins l
left join users u
on l.user_id = u.user_id
- and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
+ and l.login_date between u.effective_date and u.expired_date
group by u.username;
The solution (which I'll describe in a sec 😄) is that since these are really just two different very common join patterns, they are easily producible if we had a way to know which columns from the nodes to plug in. Here are the two patterns above, illustrated using representative names for the keys.
- and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
+ and <effective_time> = (select max(<effective_time>) from users u2 where u.user_id = u2.user_id)
- and l.login_date between u.effective_date and u.expired_date
+ and <event_time> between <effective_time> and <expired_time>
When building a cube, we really just need to know which column on each metric represents the event_time
and which columns on each dimension represents the effective_time
and expired_time
. If DJ included a system to define globally unique column tags, they can be utilized to apply these common patterns when the columns are properly tagged. This would be a generic column tagging apparatus that we can even open up to for users to utilize (although there are more things to consider there).
To solve the examples in this comment, those three global column keys can be created and columns on the nodes can be tagged appropriately. Then, during a request, users can add an optional arg that provides an array of keys to use. The flow would look like this:
effective_time
and expired_time
global keys, respectivelyevent_time
global keyGET /metrics/num_logins/sql/?dimensions=user.username
...DJ can default to:
and u.effective_date = (select max(effective_date) from users u2 where u.user_id = u2.user_id)
effective_time
global key: GET /metrics/num_logins/sql/?dimensions=user.username&global_keys=effective_time
(effective time is what's used to get the latest value for each distinct primary key)event_time
, effective_time
, and expired_time
global keys: GET /metrics/num_logins/sql/?dimensions=user.username&global_keys=event_time,effective_time,expired_time
...DJ would then use:
and l.login_date between u.effective_date and u.expired_date
The directions for users would be straightforward. First use the global keys to categorize the temporal columns on the nodes. Then either use the default behavior of only using the latest dimension attributes, or alternatively add global_keys=event_time,effective_time,expired_time
to the request to get the dimension attributes as of the event time.
I captured the tables and queries in the above example as a gist that should work in sqlite. https://gist.github.com/samredai/3d5f2294e6b364ada6ebbf559a1e4e68
This is great! On the subject of column tagging, I just added an issue that's slightly related on adding column annotations to let users explicitly annotate their dimensions (https://github.com/DataJunction/dj/issues/386). I think we can roll these concepts in together -- i.e., in this case, someone would add column annotations marking the columns that represent the different temporal options, but they could use this same system to mark columns as dimensions. (With likely plenty more metadata to come...)
@shangyian yes! Admittedly I was dragged towards taking a more generic approach to this by @CircArgs but it completely makes sense to me now. Thanks for tying in another big use case. 😄
Nice to see all these ideas coming together. @shangyian I agree that tagging of columns with these "special" attributes should be a generic feature. I also think it would be good to try and define some generic functionality around them. What do you all think about this:
Column tagging with global/system tags:
event_time
, effective_time
, expired_time
.
These tags could have a basic constraint support, mostly to enforce uniqueness at different DJ levels, e.g:
global_tags:
event_time:
level: column,
description: Points to a column which represents the time of the event in a given fact related node. Used to facilitate proper joins with dimension node to match the desired effect.
uniqueness_scope:
- node
- column type # we could allow `event_time` to be marked on more than one column in a node as long as the data type is diferent
# potentially other rules could be offered here, e.g. `allowed_node_types: dimension`
},
...
}
latest
would not require any event_type
tag info on the dimension side, but if one exists DJ would know how to collapse the dim node before the join.as-of
would require event_date
to be present on the metric side, and some combination of event_time,effective_time,expired_time
to be present on the dimension side.@samredai I think the difference between my above suggestion and your global_keys=event_time,effective_time,expired_time
is small but I think the cube-making/metric-reading user will have an easier time telling us what type of join to use, rather than calling out the keys/tags to utilize, because the list of tags/keys on its own does not define a method to use them in a join. Its a subtle thing so would love to chat about it when we talk next week.
@shangyian I think I was saying the same thing elsewhere to @samredai about how the keys are so generic they could also have been a route to the same functionality as how we reference dimensions nowadays with the caveat that that means you are also tagging the dimensions if using keys. Is that what you mean also?
@agorajek all of that makes sense to me. I was heading towards the idea of easily selectable join types using a pre-defined and required set of global keys but my comment was already getting way too long. ;)
In this issue I hope
to convey a potential solutionto discuss the issue of leveraging dimensions that have something like an event time DJ would need to account for on behalf of a metric user.Here is an initial proposal
Take this example where
event_ts
is the time an event occurred andproc_ts
is when the data was ingested/processedthis would be the query that would define
Country_Latest
Country_Latest
would be a "derived dimension" by which we mean it adheres to rules such as:DJ would then allow such derived dimensions to be used in place of their parents. For example, as seen in the diagram above,
Likes
points toCountry_Snapshot.country_id
andCountry_Snapshot.snapshot_ts
and in doing so would also allow derived dimensions to be used on the remaining keys which forCountry_Latest
isCountry_Latest.country_id
. DJ would easily be able to make the assessment based on the criterion above to determine a dimension is derived and when a user requests a derived dimension DJ will be able to trace through it to the underlying dimension to use it instead.To summarize, the main point here is that derived dimensions could drop parent dimension join keys and DJ is capable of managing this (without too much effort to develop such a feature I believe)