Closed macobo closed 2 years ago
The experiment is live and gathering data.
Some gotchas for when we're moving to full solution:
Been running some measurements on this experiment using clickhouse-flamegraph and wrapping it in my own script to make measurements easier.
I found a pretty remarkable trends query (1) which was doing a lot of inline distinct_id joins and ran in ~23 seconds, reading 25393920 rows.
I modified the query (2) to only look at rows where data is present and it ran in 3146ms, same amount of rows. Lastly (3), I replaced the inline queries with a join based on the distinct_ids column, this ran in 1156ms, reading 12027294 rows.
Queries + the CPU flamegraphs can be found under https://github.com/PostHog/scratchpad/tree/main/karl/2021-07-23-person-distinct_ids-column
From the flame graphs one can pick up a lot less work is being done joining the data together (JoinSideTransform) + on reading, which matches expectations.
Worth noting that most queries will not see an improvement this remarkable - the original query had 10 separate person_distinct_id subqueries.
Conclusion: This is a worthwhile investment. Will sync with team and see what they think!
@EDsCODE noted two gotchas:
Another example (on a large client) can be found in https://github.com/PostHog/scratchpad/tree/main/karl/2021-07-26-person-join-trends.
This is a very basic trends query. The original query performs as follows:
query_duration_ms: 9654
read_rows: 216619277
read_size: 335.89 GiB
result_rows: 1
result_size: 256.00 B
memory_usage: 3.29 GiB
Using a join on person
table only:
query_duration_ms: 5493
read_rows: 186371941
read_size: 334.37 GiB
result_rows: 1
result_size: 256.00 B
memory_usage: 1.37 GiB
This is a 43% timing improvement.
From looking at the CPU flamegraph the win is not obvious - in both cases the majority of time is spent reading data on disk. However looking at the Real time graph shows that in the query with a join, threads spend much less time waiting for results from other threads.
Memory usage also significantly improved - probably because only part of person
table needed to be kept in memory instead of all of person_distinct_id
+ person
.
Notes on how we might implement this. It's kind of hard/annoying
This is a pretty large-scale data migration. Complicating factors are:
Similar to how we've "repaired" person_distinct_id
tables in the past, we could:
kafka_person
and person_mv
tablesperson_swap
table the values from CSV fileRENAME TABLE posthog.person to posthog.tmp_person, posthog.swap_person to posthog.person, posthog.tmp_person to posthog.swap_person on cluster posthog
kafka_person
and person_mv
tablesThis is both safe and correct, however it's a pain to do this on the open-source chart version given various limitations (e.g. danger of running out of disk, where to run this/when, how to check for failures properly)
@fuziontech @tiina303 thoughts on how to do this safely on the charts? Or should we use a different strategy?
As mentioned above, the current implementation has some pretty bad race conditions.
I think the best option is to:
Note that this comes with a significant downside.
Fetching person
more expensive now - every time identify/alias/$set happens we need to fetch distinct_ids
to insert into clickhouse. That said, we're already fetching all properties every event already.
This might easily become an ingestion pipeline bottleneck as the query is the most expensive one we do in plugin server.
The 'easy' part - once 1 and 2 are live, we can update most queries one-by-one. In most cases we'll just join events with something like:
FROM events e
INNER JOIN (
SELECT id, properties, distinct_ids
FROM (
SELECT
id,
argMax(properties, person._timestamp) as properties,
argMax(distinct_ids, person._timestamp) as distinct_ids,
max(is_deleted) as is_deleted
FROM
person
WHERE
team_id = X
GROUP BY
id
HAVING
is_deleted = 0
)
ARRAY JOIN distinct_ids
) as p ON p.distinct_ids = e.distinct_id
person_distinct_id
We should only do this couple of months down the line as we've updated all queries and everything is stable.
Q from Neil: Does this slow down queries only joining against person_distinct_id?
1. Migrating data
This is a pretty large-scale data migration. Complicating factors are:
- this also needing to work on our new helm charts
- ideally not causing any downtime (as a naive migration script could cause)
Seems like reasonable requirements
Similar to how we've "repaired"
person_distinct_id
tables in the past, we could:
- Temporarily drop
kafka_person
andperson_mv
tables- Copy person table (with query that includes distinct_ids from plugin-server) into a CSV file
- Create and insert into a
person_swap
table the values from CSV file
This is a bad pattern. If you can avoid copying data out of the database you should as much as possible. This can be done with a simple CREATE TABLE AS SELECT
https://clickhouse.tech/docs/en/sql-reference/statements/create/table/#from-select-query avoiding copying anything into or out of the database. The nice thing is even with a sharded table this reduced the load on the cluster since all of the copying would just be done within the same shard on the same node vs distributing the reads and writes to create one giant CSV.
RENAME TABLE posthog.person to posthog.tmp_person, posthog.swap_person to posthog.person, posthog.tmp_person to posthog.swap_person on cluster posthog
- Recreate
kafka_person
andperson_mv
tables- (Once everything is stable) Drop the old person table
✅
This is both safe and correct, however it's a pain to do this on the open-source chart version given various limitations (e.g. danger of running out of disk, where to run this/when, how to check for failures properly)
@fuziontech @tiina303 thoughts on how to do this safely on the charts? Or should we use a different strategy?
@fuziontech thanks, this simplifies things significantly. For context, I was using your swap scripts as a basis, but what you suggest def makes sense :)
In light of: https://github.com/PostHog/plugin-server/pull/510. This implies we dropped all of our "huge" persons (lots of distinct_ids
), which might skew results in favour of this experiment (since person.distinct_ids doesn't exist for these persons anymore, while it does (maybe,depending on the batch) exist in person_distinct_id
table).
It's probably not a big deal, but it does reduce my confidence in the results a bit, and has some implications for my question earlier ( https://github.com/PostHog/posthog/issues/5167#issuecomment-887509115 )
This is going on the backburner. One issue we're running into 1MB row limits + the wins are lower than deserialization.
We're going a different route now, with person-on-events instead
Is your feature request related to a problem?
Splitting out from https://github.com/PostHog/posthog/issues/5112
One potential speedup is having distinct_ids on the person tables. Do some measurements and if needed implement this to speed queries up!
Note: This could introduce new race conditions into plugin server, be aware
Soft-blocked on https://github.com/PostHog/posthog/issues/4242
Thank you for your feature request – we love each and every one!