Closed awoehrl closed 10 months ago
Thanks for the flag @awoehrl - I could use your input here if you're open to it. My rationale is that more often than not, the columns designated as optional in the activity schema spec will be null for the majority of activities for a given activity schema implementation, which is why I have the null_columns
set up with the defaults they currently have. I'm almost certain that my assumption is true for link
and revenue_impact
, but I haven't used anonymous_customer_id
in any activity schema setups for myself yet, so I'm not sure how often that column is left null in practice. For the pipeline you're building, what proportion of activities have a non-null anonymous_customer_id
and what proportion have a null anonymous_customer_id
?
Makes sense @bcodell! Our case is probably almost the opposite as we are working with lots of web analytics data. We have a very high number of anonymous customer ids and compared to that almost no customer ids. Also link
will be filled in many cases with the URL where the event happened.
In numbers, our biggest activity customer__viewed_article
would have 10000 unique customer ids associated per day and arround 1.3 million anonymous customer ids. For most events we will have an anonymous id as these are generated via web analytics (Cookie ids).
A second stream would be around produced articles. There we have a customer id and no and no anonymous id, but this would only concern around 1000 events per day.
Ah interesting. Based on your perspective, maybe my intuition is off and a better assumption to make is that if a developer specifies that they want to include any of the optional columns (link
, revenue_impact
, anonymous_customer_id
), then the build_activity
macro should assume by default that the developer will explicitly specify each of those optional columns. In that case, I'd change the default null_columns
argument in the build_activity
macro to be none
, derive which columns to look for based on the stream config, and for specific activities that don't use optional columns, the user can specify the columns that aren't used in the null_columns
argument. Does that seem like a reasonable change?
If not, an alternative is to keep the null_columns
argument in the build_activity
macro, and allow users to override the defaults - likely as a config option in the stream
config in dbt_project.yml
.
I'm leaning towards the first approach I described - it's probably best to keep deviations from the standard configuration to be more explicit. But let me know what you think!
Notes from conversation:
null_columns
arg default to none
in build_activity
macro, and give developers a choice:
null
(e.g. null as revenue_impact
)null_columns
argument in macro (e.g. null_columns=['revenue_impact']
)
I forgot about this, because I added the null_columns attribute when doing my first tests with dbt-aql. If I don't explicitly set
null_columns
, myanonymous_customer_column
will be nulled. Not sure if it's a bug or missing in the documentation though.dbt_project
Example activity
Output Here anonymous_entity_id will be nulled:
cast(null as STRING) as anonymous_entity_id
Explicitly setting null_columns works: