Closed mariusandra closed 1 month ago
@tiina303
A case for importance (being able to replay events later is already called out and we saw a need for that this week), but additionally network is unpredictable - sometimes a packet sent later will arrive earlier, sometimes an earlier packet fails and will be retried later.
For set
we can just look at timestamps, however for set_once
we can't as it shouldn't overwrite anything that was earlier, so imagine:
Timestamp | function | value
1 | set | 1
2 | set_once | 2
Regardless of the order of arrival we want the value to be 1 in the end. If 2 is sent first, we need to know it was a set_once
call that set it to know that we should overwrite it with the earlier value.
Furthermore this needs to also end up with val = 1 even if packet 1 is sent later.
Timestamp | function | value
1 | set_once | 1
2 | set_once | 2
I propose we keep an additional column set_once:bool
that tells us if it was set by set_once
, if it was and we see an earlier set
or an earlier set_once
, we need to overwrite it and update the timestamp to earlier.
Second tricky function is alias
, currently we say we resolve conflicts by overwriting with the first distinct ID values if there's a conflict, however here if we get out of order events it gets tricky to resolve them, instead we could just rely on timestamps (latest one is used, well the same set/set_once applies as above). Then we can just look a the current values with timestamps & later coming values and we can just see this as one person and set/set_once already work in any order (I assume if we get a later set/set_once for someone who was aliased we can follow the redirect to the new person).
Hmm... how come we need to complicate this?
We give each property a timestamp. Then if $set
comes in, we override if its timestamp is later than the property's timestamp. If so, we override. If $set_once
comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.
Or what am I missing?
If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.
Imagine someone uses set_once
and set
both for the same key, we expect a value written by a later timestamped set
not to get overwritten by an earlier timestamped set_once
. A real world example of this could be we have some places, where we have weak evidence for user, where we want to only write those, if there wasn't a value before. For example imagine an App, where a user can set location in their profile but the app also uses the location based on where the user is visiting the site as the weak evidence. So Alice visited a site from London (which sends a set_once
) and updated her location to be Dublin (which sends a set
request), due to network problems the set_once('Alice', {'location':'London'})
request arrives later, but has an earlier timestamp (compared to the stored timestamp that the set
request had). We don't want that to override the value, i.e. we want in the end to have 'location':'Dublin'
regardless of which order the set
& set_once
requests arrive.
I'm still not following. If requests arrive later because of network issues, they should still contain the original timestamp, no? That's why we have all these now
, sent_at
and timestamp
fields in the system.
That's why we have all these
now
,sent_at
andtimestamp
fields in the system.
I'm not actually sure how & where these three are set, I was making the assumption that we only have timestamp
(which is based on what we want to either overwrite the value or not).
I'm still not following.
Sorry this can be confusing & maybe I'm misunderstanding something. You can check the table below, maybe that's more helpful.
To elaborate on my earlier example ... let me provide the exact calls I was thinking & what we'd want to happen. Also adding numbering to my statements, so it's easier to call out where it doesn't make sense anymore.
set_once('Alice', {'location':'Dublin'}, timestamp="16:00")
set('Alice', {'location':'London'}, timestamp="16:01")
(1) For this in the end we want the outcome to be Alice.location == London
(2) if the calls arrive in this order, it's simple we first set the location to be Dublin
and later see the set
call, which overrides it to London
.
(3) The above works based on your proposal
if $set comes in, we override if its timestamp is later than the property's timestamp.
(4) Now imagine network was bad and the requests arrive in the reverse order
set('Alice', {'location':'London'}, timestamp="16:01")
set_once('Alice', {'location':'Dublin'}, timestamp="16:00")
(5) After having processed the first set
call, we have Alice.location == London
with the timestamp "16:01".
(6) Now processing the set_once
call. If we apply what you proposed ,
If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.
(7) then we see that set_once
came in
(8) we see that it does have an earlier timestamp compared to the stored one
(9) so you say we should override
(10) we override and end up with Alice.location == Dublin
(11) but that's not what we wanted, we should get Alice.location = London
despite that the requests arrived in this order.
Here's a table to explain when the override should happen. The example I provided above corresponds to row number 7. Based on rows 3&4 (also 7&8) differing depending on what the previous call that wrote the value was I can't think of a way to override correctly without that information.
| call | value exists | call TS is ___ existing TS | previous fn | write/override
1| set | no | N/A | N/A | yes
2| set_once | no | N/A | N/A | yes
3| set | yes | before | set | no
4| set | yes | before | set_once | yes
5| set | yes | after | set | yes
6| set | yes | after | set_once | yes
7| set_once | yes | before | set | no
8| set_once | yes | before | set_once | yes
9| set_once | yes | after | set | no
10| set_once | yes | after | set_once | no
Thanks for being patient and taking time to break this down. Now I get the issue and adding this column does make a lot of sense!
This discussion (+ @tiina303's table particularly) was so useful
This issue has 1133 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
https://github.com/PostHog/posthog/issues/7454 Person Created At needs to also handle ingestion in any order
Curious what the status of this is? Anything I can to do help at all (e.g. manual testing, etc)?
Potentially ... here's how I'm thinking about doing the testing: I have a DigitalOcean instance already (do-lon1-tiina-testing-2021-11-30-2054 with hostname/user/pass here: https://posthog.slack.com/archives/C01MM7VT7MG/p1638494244381100?thread_ts=1638494082.380800&cid=C01MM7VT7MG). The plan is:
NEW_PERSON_PROPERTIES_UPDATE_ENABLED_TEAMS
to be all the appropriate project IDs, my network was too bad for image push 🙄 I suspect we'll see some differences that we'll need to ignore, e.g. maybe geoIP or maybe turning that off before is better.
Relevant recent conversation (tied to person properties updates): https://posthog.slack.com/archives/C0374DA782U/p1660840149534569
We have properties_last_operation
and properties_last_updated_at
but they aren't always updated properly.
Given how our js lib sends browser etc every time we would increase the number of writes to postgres if we updated those fields (specifically timestamp for properties_last_updated_at
) properly.
This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale
label – otherwise this will be closed in two weeks.
This issue was closed due to lack of activity. Feel free to reopen if it's still relevant.
Whenever we ingest events, we end up with entries in 3 tables:
events
,persons
andperson_distinct_ids
.Excluding plugins for now, if we ingest the same events again in the same order, we will (most likely) end up with the same entries in those tables.
However, I think we should also get exactly the same entries in those tables if we ingest events... backwards. If we do, it would make ingestion, retries and exports a lot simpler. For example, we could export multiple months in parallel from one posthog instance to another and know that all persons were transferred as they are.
This task touches two things:
The details and feasability of the above points need to be specced out.