macobo commented 2 years ago

Background

Person properties are a key part of posthog for our customers to be able to analyze their users.

Person properties include things like (examples from our project):

Visitor metadata: Their location, browser, initial location, UTM tags, OS, etc
"schema" information - email, is_signed_up, anonymize_data
Other information: UTM tags, billing plan, what currency they use, etc

Currently when you update person properties, it updates it across all of time. This means if someone visits with a new UTM tag, switches their billing plan and so on we override the previous values

This is:

Good: If someone does things before signing up, we can later analyze this. Example: What UTM tags do paying users have?
Bad: If someone changes a person property, historical insight results change. Example: Graph of paid users over time changes if a user churns even if they paid previously.

Possible alternative(s)

Record user properties changing or store user properties along event properties.

This would:

make it so historical data would not change
improve data integrity (cc @yakkomajuri can comment closer here) - updating person properties is harder due to strict ordering requirements
Potentially speed up queries (depending on how we're storing things) - can cache more + perhaps store user properties alongside events.

What do other analytics products do

Most analytics products seem to use a similar model to us. However Amplitude for example does handle properties changing over time: https://help.amplitude.com/hc/en-us/articles/115002380567-User-properties-and-event-properties

Thoughts?

We've committed a lot of resources recently around topics related to person properties - it might do good to revisit the overall concept from a product perspective.

cc @timgl @tiina303 @yakkomajuri @marcushyett-ph @paolodamico

hazzadous commented 2 years ago

@macobo are all the updates to person properties also stored in events that we store already?

yakkomajuri commented 2 years ago

I've had a think about this and also chatted with @tiina303.

Ultimately, I feel like this is an education problem, as well as an issue with how we ourselves think about person properties.

In my view, person properties are things that should not be mutated often. A person is, after all, just an aggregate state from events over time. Thus, if you want to know the state of something at a given point in the past, you should do it with event properties.

The problem is that we ourselves started misusing person properties. GeoIP properties, for example, should not be set on persons in my view. One day I'm here, the next day I'm there. This data is included with every event, so why does it need to be on a person?

I think person properties make sense for things like email, IDs, "initial" properties ("initial referrer", "initial page viewed"), states that hold true forever (e.g. "was a paying customer once"), and states where the past doesn't matter (e.g. "is a paying customer now").

Now, there is an issue with properties not being set on all events. Enter education. We have posthog.register in posthog-js, and we should encourage people to use it.

On all libs without autocapture, you're explicitly creating an event each time, so can definitely pass in the data you need. With autocapture, however, to ensure all your events have the right data, you just need to use super properties.

Finally, I gave a bit of thought to the technical implementation of properties over time and think it is both problematic and just not worth it. But I'd rather dispute this on product merits.

Essentially, we already support user properties over time: they're called event properties.

yakkomajuri commented 2 years ago

More practically, everything in PostHog is event-based (yes, even funnels).

You're never looking at persons - you're looking at persons who did an event. Think unique users in Trends, or a funnel step. So the historical data is already there.

Let's take the example from the description:

Graph of paid users over time changes if a user churns even if they paid previously.

I think this is only the case if you're doing person property-centric analytics, which we should educate people about.

How do I know my current paid users? Well, that's a good use for person properties.

How do I know my paid users in the past? You should have a payment event, filter by unique users.

How do I know how my paid users in the past were using the product? Here you should use event properties.

We already have all the tools in place for historical properties analysis, we just need to help people use it correctly in my opinion (starting with ourselves).

Nevertheless, happy to get pushback here. I could certainly be wrong.

macobo commented 2 years ago

@hazzadous and @yakkomajuri let's take a step back and forget how PostHog is implemented at all right now, if we solve the usecase right now or even how we'd implement this.

The question is - if we were designing a system from scratch that had to solve problems relating to users and state-changes in the best way for our customers, what would that look like from the customers perspective.

So let's consider these related questions you might ask from an analytics product:

How many paying users do I have over time? (graph of users over N days)
How does conversion on a funnel compare over payment tier?

The first is a question every company needs to be able to answer and the second is a really powerful way to connect sales and product decisions.

Some potential solutions users might adopt at this time:

setting "subscription" user property (current approach): We look to solve 1 and 2 as customers initially integrate. N months down the line user data has changed enough that the results are not consistent anymore.
separate "payment" event: we couldn't answer neither 1 or 2 since payments might occur for some users monthly, some users yearly and there'd be no way to compare those.
separate "payment" event sent per day in a cronjob: this is similar to what we do with instance status reports. This works super well for 1, but doesn't help with 2 at all. It's additionally a lot of code that the user needs to update and manage.
using "super" properties: this works for 1 and 2 as long as all events are only being sent from posthog-js. However all of our clients seem to be using a mixture of backend and frontend libraries, meaning that for (2) you can't really rely on it.

In the future users could:

Create group per payment tier: this works for both. Downsides: Requires all tracking to be updated, there'll be a max limit for group types meaning it will likely cause pain down the line.

It ultimately boils down to what Yakko said. (Paraphrasing) PostHog does not work well for user- (and by extension group-) property based analytics if these properties change over time. Even worse, we appear to work for these usecases, creating initial expectations that translates to mistrust as issues inevitably arise.

From my perspective (basing this from past experience), as a customer I'd much rather have user properties be point-in-time: if you call posthog.person.set({ foo: 'bar' }) that property only going forward until you explicitly override.

However even this has downsides. Quoting amplitudes docs

When a user property's value changes, Amplitude charts can show the user in both the new and the old user property categories. This overlap only applies for the specific day on which the property value changed.

This is only my thoughts though - I'm interested in what others think how users should approach this.

tiina303 commented 2 years ago

I feel like there are

timeless properties - for which I don't care about the older values, e.g. email, initial referring site. As a user I would regardless of time want to look at users who have email what have they done in the past & don't want them to be filtered out before they entered email on their profile. This can be implemented as point-in-time properties and only using the current timestamp value, but probably less efficiently.
point-in-time properties - for which I feel like storing them on the event is a good strategy (seems like Amplitude does that too) and we can do it either by setting them on every event or computing the value every time (potentially storing/caching for efficiency). To know a value at a particular timestamp we can essentially compute if set exists before this timestamp use the latest before the timestamp set value ; if not use the earliest set_once value What about the case when a user is merged later? Do we want to count them historically separately or as 1 unique user?

separate "payment" event: we couldn't answer neither 1 or 2 since payments might occur for some users monthly, some users yearly and there'd be no way to compare those.

We could solve this if we also have a separate churned event (or whenever they get bumped down to free tier) by plotting a cumulative graph payments - churns

@hazzadous

are all the updates to person properties also stored in events that we store already?

Currently we store only the user properties that we actually updated on the event. I have a PR to store the properties sent with the event (https://github.com/PostHog/plugin-server/pull/615).

mariusandra commented 2 years ago

This feels like we might suddenly need to store a lot more data... unless we version person, and just add a version field to each event. If we'd do that, we could even let the app user choose if they'd like to see the person properties in a query as they are now, or as they were when the event happened (default choice).

In real world usage, this would make it possible to actually create a "free to paid" funnel (or correlation analysis!?!) in 6 seconds, when all you have is a mutated person property (a common case perhaps?).

This would also simplify feature flags. Now we store them explicitly with events even though they're essentially properties on a person in a point in time. Having a "app-level" solution to this might open many other use cases.

So, yeah, LGTM.

kpthatsme commented 2 years ago

From my perspective (basing this from past experience), as a customer I'd much rather have user properties be point-in-time: if you call posthog.person.set({ foo: 'bar' }) that property only going forward until you explicitly override.

Agree very much w/ this.

Here's an example use case where mutable properties (over time) are important, in combination w/ a static point of reference-

Let's say we wanted to do some analysis on the behaviors users take when they're on an open source or scale plan before they become enterprise clients. Now in theory the set of people and behaviors we're trying to match for today would all already be on the enterprise plan, and that would be their current user property value.

It's important that all of the events they had sent to PostHog before switching their plan are associated with that old plan value so we can do the right analysis on behaviors before the change occurred.

pauldambra commented 2 years ago

Some user properties are mutable (and may be important to change)

For example: If a person changes their name it is normally important to them that their new name is used. If it is because of transitioning it will be particularly important for that user that the property changes.

Some user properties mutate depending on where you stand

If I run a food delivery website. I might not care about the user's local time when predicting load on my servers. Because the user's "now" and the server's "now" might be in different timezones.

But if I'm scheduling delivery drivers. The fact that they order food at 1am becomes important.

So what?

If person property changes are events, then we can:

ask "what are the person's properties now" const user = userPropertyEvents.reduce((properties, event) => ({ ...properties, ...event }), {})
ask "what were the person's properties then" const user = subsetOfUserPropertyEvents.etc
snapshot the user so we have a (possibly conceptual) stream of user states. Then use those to make querying users faster const user = userPropertyEventsSinceTuesday.reduce((properties, event) => ({ ...properties, ...event }), userOnMonday)
if we have snapshots we can reference the user snapshot in every event, or merge its properties into each event so we can ask point-in-time questions (I'd expect this to make processing events out of order harder)

To make things (maybe) clearer with a diagram:

PXL_20211105_104332917

If person properties are in the event stream I can calculate a person by reading backwards or forwards through the stream. Or I can maintain a person stream and write either the change event to it each time I see a person property change event. Or I can calculate a new person state each time I see a person property change event. And so I can get their current state by reading the head of that stream

I guess this doesn't match how we're using Kafka and ClickHouse right now... But you could presumably write states of a person into ClickHouse so that they're queryable. Or the events and fold over them each time you read them. So long as you don't end up with millions of person events for a single person then neither approach is too onerous

(I am hoping this is either helpful - or is the wrong answer and so I learn by being corrected :))

tiina303 commented 2 years ago

I've added this as P3 to the platform backlog. We currently treat use properties as timeless. If one wants point-in-time values they can add the properties to events. In the frontend there are super properties for this in the backend we want to keep our libraries as stateless as possible, but there's full control over sending properties from there. Furthermore plugins could be used to attach extra properties to events. The mainly trickier case is for mixed frontend and backend usage (could be solved with plugins if we allowed custom ones), but solving this on our side is complicated and currently we haven't seen that much user interest in it.

posthog-contributions-bot[bot] commented 2 years ago

This issue has 2067 words at 9 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

macobo commented 2 years ago

@tiina303 not sure if there's anything to backlog yet given a product decision isn't made here - what's there to develop? As-is there doesn't seem to be a direct action point here.

We chose a (reasonable) direction in the beginning, but it has its tradeoffs. We haven't made a decision on whether there's a fundamental simplification we could make. I'd avoid incrementally piling on complexity and let this sit for a while and revisit it in the long haul instead.

tiina303 commented 2 years ago

Maybe I didn't use the right thing (backlog + P3 priority, what should I have used?), but we are on the same page otherwise: do nothing now, there is no concrete design/implementation proposal, potentially revisit in the future.

macobo commented 2 years ago

A new usecase came up for this: Tracking feature flags/experiments.

Currently only posthog-js sends us what feature flags are active at a given time. So if you want to track a funnel breaking down if the flag was active/not active, it isn't currently (accurately) possible if any of the tracked events are sent from other libraries.

With properties that were point-in-time you could store the active flags on the person and change them as conditions change :)

cc @neilkakkar @paolodamico

paolodamico commented 2 years ago

I'd be in favor of supporting this if we could (unsure about all the technical ramifications) but I'll defer entirely to @neilkakkar as AFAIK we're sending feature flag details in each event so we wouldn't necessarily need this for experimentation?

Aeolun commented 2 years ago

I'm not sure if this has been considered, but since posthog is already keeping track of 'sessions', can we attach properties to sessions? I'm fairly certain I've seen this in other products, and it wouldn't affect any of the current behavior around persons/events.

tiina303 commented 2 years ago

We are currently working on adding person properties to events, see this project for progress tracking https://github.com/orgs/PostHog/projects/41/views/1

PostHog / posthog

Should person properties be mutable across time? #6690

Background

Possible alternative(s)

What do other analytics products do

Some user properties are mutable (and may be important to change)

Some user properties mutate depending on where you stand

So what?