Updates table will create "blind spots" between creation and update times

michaelsnook commented 9 months ago

This is an issue occurs both when backfilling records (when deploying the SCD script for the first time) and when adding new records with no entries on the updates table which have different creation and update times.

My use case is we have a users table and users take actions and the users table has some slowly changing fields like their preferred language and their country of residence. So we create the updates table, we create the view with scd_valid_to and scd_valid_from and then we can join any action to the state of the user that existed at the time of the action (in theory). e.g.

SELECT u.language, u.country, a.*
FROM ${ref("action_tool", "action")} action
JOIN ${ref("scd", "user")} user
  ON (user.id = action.user_id)
WHERE action.created_at BETWEEN user.scd_valid_from AND user.scd_valid_to

This is very convenient and fast except it doesn't work in two specific situations.

1) Back-filling old data: Consider a user, who joined in 2020, took dozens of actions over the last 4 years, has a recent updated_at of some time in late 2023. When I run the SCD script for the first time I get a single history record in 2024 with an updated_at value of 2023, and the view with scd_valid_from shows the same date in 2023. This means that ALL the action records from 2020 to late 2023 will fail to join in the above query. (This is what I mean by orphan records.)

2) Changes before the first update record: Similarly, if the SCD script is active and running daily, and a user signs up by taking an action at 12:00, not specifying a language so our system defaults them to English, and then at 12:05 they go into their profile and set their language to Spanish, and then at 2:00 the SCD job runs... it will insert a record with created_at: 12:00, updated_at: 12:05, and the view will show scd_valid_from 12:05. So the very first action they ever took (a very important one, from a business perspective) will be orphaned in the same way scenario 1.

It would be possible to complexify the views I'm using to join actions to histories, but this defeats the purpose of the convenience view. It might also be possible to add a configuration option for a created_at field so that the view detects when it's dealing with the first ever record update and set scd_valid_from to the creation date (figuring maybe there were changes in between there but we don't know so we have to be okay with some loss of specificity in exchange for not breaking all our other queries).

I think a better solution would be to add a config option for a created_at field, and have the insert script:

In the query that identifies IDs to insert, also identify whether the record is completely new to the updates table or whether it's simply being updated.
When the record is being updated only, use the present behavior.
When the record has being inserted for the first time, if the created_at and updated_at values are different, insert TWO updates, one normal one (like the current logic) and one simulated original record, e.g. select * EXCEPT(updated_at), created_at AS updated_at

With this logic the Scenario 2 situation would create two records, and the scdvalid* view would look like this

created_at	updated_at	user_id	scd_valid_from	scd_valid_to
12:00	12:00	613	12:00	12:05
12:00	12:05	613	12:05	NULL

This way, trying to join the original action from 12:00 and the second action from 12:05 will both work, whereas under the current logic, the first one will not join.

michaelsnook commented 9 months ago

Yesterday I implemented this "duplicate first update record using creation date" in our own system but then it occurred to me that we could actually check for "this is the first instance" and return the created_at field, like this:

SELECT *,
  IF(LAG(${timestamp}) OVER (updates_window) IS NULL, ${creationTime}, ${timestamp}) AS scd_valid_from,
  LEAD(${timestamp}) OVER (updates_window) AS scd_valid_to
FROM
  ${ref(schema, `${name}_updates`)}
WINDOW updates_window AS (PARTITION BY id ORDER BY ${timestamp} ASC)

You can see here the only change made to the current logic of the view is that in addition to using the LEAD() function to find when we have the last update, we also use LAG() to find if we have the first update, and when we have the first, return scd_valid_from as the creation time, not the update time, knowing that we may have missed some changes in between creation and now, but at least we will not have orphaned records when we try and use the data.

I did it with a window function hopefully to avoid slowing down the query, ~but I haven't profiled it~. I think this solution is a more minimal change to the library, because it doesn't change the data shape or structure or number of rows stored, but I'm not sure it's better. I think it is probably actually better to do it the way in the original comment, with the sort-of "mock" first update row, which simulates our best-available-guess at what the data would have looked like at the time, accepting the limitations of not having realtime events for each update.

UPDATE: On our own dataset with about 6 million records using this LAG method, tested against our dataset with 10 million records when using the synthetic originals and the original/existing view, the LAG approach was much faster, used less slot time and scanned fewer GBs than the synthetic originals approach (just running select * against both views). Perhaps in the real world these differences would be less because we'd join to these views with a WHERE ${timestamp} BETWEEN scd_valid_from and scd_valid_from and we'd end up scanning far fewer records, so again I can't quite say I've fully profiled it, but it does appear that the second use of the window function doesn't create the cost/performance hit I was worried about, so perhaps it is a better solution overall, leaving the "approximation" step to the view without creating any additional/synthetic records in the permanent table.

Mats-Elfving commented 5 days ago

Thank you @michaelsnook , I had this exact wish on how it should work. Have you any idea on when / if this would be included into the main branch? (@Ekrekr?)

Ekrekr commented 4 days ago

Looks good to me!

@michaelsnook could you please update the pull request https://github.com/350org/dataform-scd/pull/1 to have this repo's master branch as the upstream, rather than the fork's master branch? Then I should be able to approve and merge.

dataform-co / dataform-scd

Updates table will create "blind spots" between creation and update times #12