DataJunction / dj

A metrics platform.
http://datajunction.io
MIT License
34 stars 15 forks source link

Support multiple dimensions linked to a single column on a node #610

Open shangyian opened 1 year ago

samredai commented 1 year ago

I was half-way into implementing this (it's definitely possible if it's paired with some compile changes) but it occurred to me that there's a way to accomplish this today by fitting a "mapper" dimension in the middle and using our current dimension discovery logic. @CircArgs we chatted about this yesterday so feel free to chime in here. (Also curious to hear thoughts from @betodealmeida and @cwegener)

So to start, here's the problem that initially prompted the requirement of multiple dimensions needing to be linked to a single column. I'm going to use "campaigns" as an example context. There's a main source table that tracks all campaign ids and the number of times they've been viewed. However, there are different campaign types and the dimensions for each of those types are not conformed (they're different tables with different schemas). The problem then occurs where the campaign_id column has to map to a different dimension node depending on which type of campaign you're looking at. Currently, DJ only allows you to link a single column to a single dimension node. image

However, DJ is able to discover dimensions by walking along the graph of dimension links. This means that if there was a dimension that had a single campaign_id column that could be linked to from the source node, and then an additional column for each non-conformed dimension that's linked to that dimension's campaign_id column, dimension discovery takes care of the rest. image

The query for that dimension mapper node would be a select of the campaign_id on the original source column with full outer joins to each non-conformed dimension. To make sure this is actually true I tried it out in the demo DJ environment. Here is the graph.

Screenshot 2023-07-09 at 12 35 02 PM

And here's an example of a DJ query that takes the num_campaign_views metric and brings in attributes from all 3 dimensions.

Screenshot 2023-07-09 at 12 35 57 PM

The data turned out just as the example in the excalidraw diagram above. Each campaign is either an email campaign, an sms campaign, or a commercial campaign so 2 of the values in each row will always be null. Those nulls can be filtered out downstream or if the user is planning to filter to a specific campaign ID anyway, they'll only use the dimension attributes that are valid for that type of campaign.

https://github.com/DataJunction/dj/assets/43911210/b9c1500f-43d5-45d1-959b-3d7d5eece1e7

samredai commented 1 year ago

I forgot to add the SQL query for what I'm calling the "mapper" dimension, campaigns.campaign_id_mapper.

SELECT
cv.campaign_id AS campaign_id,
email_campaigns.campaign_id AS email_campaign_id,
sms_campaigns.campaign_id AS sms_campaign_id,
commercial_campaigns.campaign_id AS commercial_campaign_id
FROM campaigns.campaign_views cv
FULL OUTER JOIN campaigns.email_campaigns AS email_campaigns
ON cv.campaign_id = email_campaigns.campaign_id
FULL OUTER JOIN campaigns.sms_campaigns AS sms_campaigns
ON cv.campaign_id = sms_campaigns.campaign_id
FULL OUTER JOIN campaigns.commercial_campaigns AS commercial_campaigns
ON cv.campaign_id = commercial_campaigns.campaign_id
shangyian commented 1 year ago

@samredai I like the idea of having a dimension mapper concept, but I have a few questions:

samredai commented 1 year ago

@shangyian i actually mislabeled that node as a transform in that excalidraw, it should instead be a dimension like in the DJ graph image so that a source/transform column can link to it.

In terms of DJ creating the mapper automatically, I was imagining a user would manually create it when needed. I see how DJ could do it in a more automatic way but I can't clearly picture what the user interface would look like for that. In general I'm worried about too much magic making it hard to reason about what's happening and potentially harder to maintain.

I do think it's worth trying this out first by manually using the primitives and then trying to add an easy button afterwards.

cwegener commented 1 year ago

I like the concept. In my personal opinion, convention always wins over configuration or code. And I think this dimension mapper falls into the bucket of "convention".

And for now the dimension mapper solution looks elegant.

I agree that it's best to just give it a try, see how it feels. :slightly_smiling_face: