Package modeling handles duplicate events in a fewplaces, though I'd like to handle it differently in the future
I think it should be our position that analysts should not deduplicate their raw events before feeding them into the Snowplow package. This is an expensive operation, especially on a dataset this large, and it's not really in keeping with the paradigms of "bigger data" platforms (BQ et al).
To that end, we should disable the unique test on snowplow_base_events.event_id.
Given:
I think it should be our position that analysts should not deduplicate their raw events before feeding them into the Snowplow package. This is an expensive operation, especially on a dataset this large, and it's not really in keeping with the paradigms of "bigger data" platforms (BQ et al).
To that end, we should disable the
unique
test onsnowplow_base_events.event_id
.