Closed jtcohen6 closed 2 years ago
OTF Add canonical_event and canonical_event_update seeds that are exact replicas of event/event_update merged with web_page/web_page_update
Yep, I think this is appropriate. Good call.
OTF Whether to include a cross-db macro to grab values from Snowplow contexts, or to include page-view plucking by default in the snowplow_web_events_tmp, or to do neither and leave it up to the installer (status quo).
I think we should leave this up to the user, but conceivably it will make sense to provide helper models / macros for very typical use cases (like Snowflake or Spectrum nested fields).
@jtcohen6 do you want me to re-review this one?
@drewbanin Yessir. I've made a few more changes—related though likely beyond the initial scope of this PR—in order to support my experimentation with external tables. Namely:
collector_tstamp
+ second
by default.get_most_recent_record()
macro instead of where ts > (select max(ts) from {{this}})
paradigm, since BQ + Spectrum + Snowflake external cannot prune scanning based on dynamic partition filters.I believe these changes are relevant. To my mind, the primary use case for this PR's functionality is when Snowplow data is loaded or queried, in its canonical event structure, directly from external storage.
I would also appreciate your eye on the failing CircleCI tests, whose operative error appears to be:
ERROR: google-api-core 1.14.2 has requirement setuptools>=34.0.0, but you'll have setuptools 28.8.0 which is incompatible.
All tests are passing for me locally.
Background
Many of our recent Snowplow installations have resulted in a single event stream table, with a schema matching Snowplow's canonical event model.
In these cases, we do not need to look up the
page_view_id
in a separate table containing web page context; it just needs to be un-arrayed and un-nested from thecontexts
object on the main events table. This change also enables a more fully incremental build, sincepage_view_id
andcollector_tstamp
are united from the start.N.B. OTF = "On The Fence" = I considered multiple approaches and picked one without being sure it's the best. Open to input.
Changelog
'snowplow:context:web_page': false
. The package will expect to see a column calledpage_view_id
directly within the event model.snowplow_web_events_tmp
model/macro. IFF the web page context is disabled, this model performs the deduplication ofsnowplow_web_page_context
—throwing away all events that have multiple page view IDs—directly on top of base events. All subsequent models (snowplow_web_events
,snowplow_web_events_time
,snowplow_web_events_scroll_depth
) build on top of this one.canonical_event
andcanonical_event_update
seeds that are exact replicas ofevent
/event_update
merged withweb_page
/web_page_update
. OTF: Whether to include brand-new seeds or to just make the join happen inbase_event
. Adding more seeds is definitely duplicate code, but it's also in accordance with our integration test practice (?) of having seeds represent the expected format of raw data.'snowplow:context:web_page': false
.Comments
page_view_id
fromcontexts
and added it to'snowplow:events'
as a column by the same name.snowplow_web_events_tmp
, or to do neither and leave it up to the installer (status quo).