PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.
https://posthog.com
Other
20.8k stars 1.24k forks source link

Experiment: Measure impact of `distinct_ids` on person table #5167

Closed macobo closed 2 years ago

macobo commented 3 years ago

Is your feature request related to a problem?

Splitting out from https://github.com/PostHog/posthog/issues/5112

One potential speedup is having distinct_ids on the person tables. Do some measurements and if needed implement this to speed queries up!

Note: This could introduce new race conditions into plugin server, be aware

Soft-blocked on https://github.com/PostHog/posthog/issues/4242

Thank you for your feature request – we love each and every one!

macobo commented 3 years ago

The experiment is live and gathering data.

Some gotchas for when we're moving to full solution:

macobo commented 3 years ago

Been running some measurements on this experiment using clickhouse-flamegraph and wrapping it in my own script to make measurements easier.

I found a pretty remarkable trends query (1) which was doing a lot of inline distinct_id joins and ran in ~23 seconds, reading 25393920 rows.

I modified the query (2) to only look at rows where data is present and it ran in 3146ms, same amount of rows. Lastly (3), I replaced the inline queries with a join based on the distinct_ids column, this ran in 1156ms, reading 12027294 rows.

Queries + the CPU flamegraphs can be found under https://github.com/PostHog/scratchpad/tree/main/karl/2021-07-23-person-distinct_ids-column

From the flame graphs one can pick up a lot less work is being done joining the data together (JoinSideTransform) + on reading, which matches expectations.

Worth noting that most queries will not see an improvement this remarkable - the original query had 10 separate person_distinct_id subqueries.

Conclusion: This is a worthwhile investment. Will sync with team and see what they think!

macobo commented 3 years ago

@EDsCODE noted two gotchas:

macobo commented 3 years ago

Another example (on a large client) can be found in https://github.com/PostHog/scratchpad/tree/main/karl/2021-07-26-person-join-trends.

This is a very basic trends query. The original query performs as follows:

    query_duration_ms: 9654
    read_rows:         216619277
    read_size:         335.89 GiB
    result_rows:       1
    result_size:       256.00 B
    memory_usage:      3.29 GiB

Using a join on person table only:

    query_duration_ms: 5493
    read_rows:         186371941
    read_size:         334.37 GiB
    result_rows:       1
    result_size:       256.00 B
    memory_usage:      1.37 GiB

This is a 43% timing improvement.

From looking at the CPU flamegraph the win is not obvious - in both cases the majority of time is spent reading data on disk. However looking at the Real time graph shows that in the query with a join, threads spend much less time waiting for results from other threads.

Memory usage also significantly improved - probably because only part of person table needed to be kept in memory instead of all of person_distinct_id + person.

macobo commented 3 years ago

Notes on how we might implement this. It's kind of hard/annoying

1. Migrating data

This is a pretty large-scale data migration. Complicating factors are:

Similar to how we've "repaired" person_distinct_id tables in the past, we could:

  1. Temporarily drop kafka_person and person_mv tables
  2. Copy person table (with query that includes distinct_ids from plugin-server) into a CSV file
  3. Create and insert into a person_swap table the values from CSV file
  4. RENAME TABLE posthog.person to posthog.tmp_person, posthog.swap_person to posthog.person, posthog.tmp_person to posthog.swap_person on cluster posthog
  5. Recreate kafka_person and person_mv tables
  6. (Once everything is stable) Drop the old person table

This is both safe and correct, however it's a pain to do this on the open-source chart version given various limitations (e.g. danger of running out of disk, where to run this/when, how to check for failures properly)

@fuziontech @tiina303 thoughts on how to do this safely on the charts? Or should we use a different strategy?

2. Updating plugin server

As mentioned above, the current implementation has some pretty bad race conditions.

I think the best option is to:

  1. Lock the row for other reads from plugin server in a transaction whenever a read would result in an UPDATE
  2. (Optional, optimization) Merge as many person updates in https://github.com/PostHog/plugin-server/blob/master/src/worker/ingestion/process-event.ts into one - we currently can do up to 4 different UPDATE/CREATE queries for the same event.

Note that this comes with a significant downside.

Fetching person more expensive now - every time identify/alias/$set happens we need to fetch distinct_ids to insert into clickhouse. That said, we're already fetching all properties every event already.

This might easily become an ingestion pipeline bottleneck as the query is the most expensive one we do in plugin server.

3. Updating queries

The 'easy' part - once 1 and 2 are live, we can update most queries one-by-one. In most cases we'll just join events with something like:

    FROM events e
    INNER JOIN (
        SELECT id, properties, distinct_ids
        FROM (
            SELECT
                id,
                argMax(properties, person._timestamp) as properties,
                argMax(distinct_ids, person._timestamp) as distinct_ids,
                max(is_deleted) as is_deleted
            FROM
                person
            WHERE
                team_id = X
            GROUP BY
                id
            HAVING
                is_deleted = 0
        )
        ARRAY JOIN distinct_ids
    ) as p ON p.distinct_ids = e.distinct_id

4. Dropping person_distinct_id

We should only do this couple of months down the line as we've updated all queries and everything is stable.

macobo commented 3 years ago

Q from Neil: Does this slow down queries only joining against person_distinct_id?

fuziontech commented 3 years ago

1. Migrating data

This is a pretty large-scale data migration. Complicating factors are:

  • this also needing to work on our new helm charts
  • ideally not causing any downtime (as a naive migration script could cause)

Seems like reasonable requirements

Similar to how we've "repaired" person_distinct_id tables in the past, we could:

  1. Temporarily drop kafka_person and person_mv tables
  2. Copy person table (with query that includes distinct_ids from plugin-server) into a CSV file
  3. Create and insert into a person_swap table the values from CSV file

This is a bad pattern. If you can avoid copying data out of the database you should as much as possible. This can be done with a simple CREATE TABLE AS SELECT https://clickhouse.tech/docs/en/sql-reference/statements/create/table/#from-select-query avoiding copying anything into or out of the database. The nice thing is even with a sharded table this reduced the load on the cluster since all of the copying would just be done within the same shard on the same node vs distributing the reads and writes to create one giant CSV.

  1. RENAME TABLE posthog.person to posthog.tmp_person, posthog.swap_person to posthog.person, posthog.tmp_person to posthog.swap_person on cluster posthog
  2. Recreate kafka_person and person_mv tables
  3. (Once everything is stable) Drop the old person table

This is both safe and correct, however it's a pain to do this on the open-source chart version given various limitations (e.g. danger of running out of disk, where to run this/when, how to check for failures properly)

@fuziontech @tiina303 thoughts on how to do this safely on the charts? Or should we use a different strategy?

macobo commented 3 years ago

@fuziontech thanks, this simplifies things significantly. For context, I was using your swap scripts as a basis, but what you suggest def makes sense :)

neilkakkar commented 3 years ago

In light of: https://github.com/PostHog/plugin-server/pull/510. This implies we dropped all of our "huge" persons (lots of distinct_ids ), which might skew results in favour of this experiment (since person.distinct_ids doesn't exist for these persons anymore, while it does (maybe,depending on the batch) exist in person_distinct_id table).

It's probably not a big deal, but it does reduce my confidence in the results a bit, and has some implications for my question earlier ( https://github.com/PostHog/posthog/issues/5167#issuecomment-887509115 )

macobo commented 3 years ago

This is going on the backburner. One issue we're running into 1MB row limits + the wins are lower than deserialization.

neilkakkar commented 2 years ago

We're going a different route now, with person-on-events instead