Closed macobo closed 1 year ago
Thanks for this! Very well thought-out and explained.
DROP DICTIONARY
-> DROP DICTIONARY IF NOT EXISTS
(although I think this is just what lingered from your cloud testingtuple
when querying dictionaries (see https://github.com/ClickHouse/ClickHouse/pull/26130), although seems we still allow up 21.6 for self-hosted folks (side q: should we change this?)OPTIMIZE TABLE
might not behave as expected event when we set mutations_sync
so we need to test this carefully when writing the async migration. Unsure if this is fixed on more recent versions, but essentially the server doesn't send anything back to the client and the client times out the connection with an error even if we increase all relevant timeouts (but the OPTIMIZE then goes on to complete successfully). Context here.Did a bunch of benchmarking on this today. Q/A:
TL;DR: Yes! More cache is best on cloud to speed things along.
I ran a few experiments with different dictionary caching solutions. One that struck out to me was doing the backfill with 10M events took:
The largest win comes from doing minimal caching, but wins continue with larger cache sizes. There are a few other knobs which sped things up:
This achieved the following results on test data set:
ββnameββββββββββββββββββββββ¬βquery_countββ¬βββββββββββhit_rateββ¬βββββββββfound_rateββ¬βelement_countββ¬βββββββββload_factorββ¬βallocatedββ
β person_dict β 1005861016 β 0.9613943403886726 β 0.9999996132666503 β 6309296 β 0.7521266937255859 β 8.22 GiB β
β person_distinct_id2_dict β 1508791524 β 0.9697972739936999 β 0.9997154537302398 β 7295240 β 0.10870754718780518 β 3.37 GiB β
β groups_dict β 5029305080 β 0.9976188046241967 β 0.9999977770288694 β 30856 β 0.02942657470703125 β 56.62 MiB β
ββββββββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββββ΄ββββββββββββ
On cloud we should use as much memory as possible under this as large memory usage can make this complete in a reasonable (estimating few days at most) timeframe.
Note I also tried out complex_key_ssd_cache
but this was discarded due to segfaults and issues on self-hosted external clickhouse providers.
For self-hosted, let's choose a conservative caching option (using less than 500MB of ram on clickhouse) but with the option to increase these parameters as the migration is run.
After filling the new columns on the test dataset, here's how column statistics looked:
ββcolumnβββββββββββββ¬βpartsββ¬βββββrowsββ¬βon_diskβββββ¬βcompressedβββ¬βuncompressedββ¬βmarks_sizeββ¬βpercentage_of_totalββ
β person_properties β 545 β 67653790 β 17.97 GiB β 17.97 GiB β 95.40 GiB β 742.34 KiB β 52.7673% β
β properties β 545 β 67653790 β 10.08 GiB β 10.08 GiB β 152.85 GiB β 742.34 KiB β 29.6131% β
β elements_chain β 545 β 67653790 β 2.10 GiB β 2.10 GiB β 27.29 GiB β 742.34 KiB β 6.177% β
β uuid β 545 β 67653790 β 1.00 GiB β 1023.77 MiB β 1.01 GiB β 742.34 KiB β 2.9383% β
β $window_id β 545 β 67653790 β 747.25 MiB β 746.52 MiB β 2.32 GiB β 742.34 KiB β 2.1431% β
β created_at β 545 β 67653790 β 425.51 MiB β 424.79 MiB β 516.16 MiB β 742.34 KiB β 1.2204% β
β timestamp β 545 β 67653790 β 425.51 MiB β 424.79 MiB β 516.16 MiB β 742.34 KiB β 1.2204% β
β $session_id β 545 β 67653790 β 412.40 MiB β 411.68 MiB β 2.31 GiB β 742.34 KiB β 1.1828% β
β distinct_id β 545 β 67653790 β 353.12 MiB β 352.40 MiB β 2.47 GiB β 742.34 KiB β 1.0128% β
β _timestamp β 545 β 67653790 β 237.44 MiB β 236.72 MiB β 258.08 MiB β 742.34 KiB β 0.681% β
β person_id β 545 β 67653790 β 150.04 MiB β 149.31 MiB β 1.01 GiB β 742.34 KiB β 0.4303% β
β group0_properties β 545 β 67653790 β 71.17 MiB β 70.44 MiB β 2.46 GiB β 742.34 KiB β 0.2041% β
β person_created_at β 545 β 67653790 β 65.97 MiB β 65.25 MiB β 516.16 MiB β 742.34 KiB β 0.1892% β
β group2_properties β 545 β 67653790 β 11.72 MiB β 10.99 MiB β 334.65 MiB β 742.34 KiB β 0.0336% β
β $group_0 β 545 β 67653790 β 10.42 MiB β 9.70 MiB β 150.17 MiB β 742.34 KiB β 0.0299% β
β group1_properties β 545 β 67653790 β 9.43 MiB β 8.70 MiB β 394.63 MiB β 742.34 KiB β 0.027% β
β $group_2 β 545 β 67653790 β 6.53 MiB β 5.81 MiB β 136.95 MiB β 742.34 KiB β 0.0187% β
β event β 545 β 67653790 β 5.71 MiB β 4.99 MiB β 1.00 GiB β 742.34 KiB β 0.0164% β
β group0_created_at β 545 β 67653790 β 5.42 MiB β 4.69 MiB β 516.16 MiB β 742.34 KiB β 0.0155% β
β group2_created_at β 545 β 67653790 β 4.64 MiB β 3.92 MiB β 516.16 MiB β 742.34 KiB β 0.0133% β
β group1_created_at β 545 β 67653790 β 3.94 MiB β 3.22 MiB β 516.16 MiB β 742.34 KiB β 0.0113% β
β team_id β 545 β 67653790 β 3.04 MiB β 2.32 MiB β 516.16 MiB β 742.34 KiB β 0.0087% β
β group3_created_at β 545 β 67653790 β 3.00 MiB β 2.28 MiB β 516.16 MiB β 742.34 KiB β 0.0086% β
β group4_created_at β 545 β 67653790 β 3.00 MiB β 2.28 MiB β 516.16 MiB β 742.34 KiB β 0.0086% β
β _offset β 545 β 67653790 β 3.00 MiB β 2.28 MiB β 516.16 MiB β 742.34 KiB β 0.0086% β
β $group_1 β 545 β 67653790 β 2.86 MiB β 2.14 MiB β 118.24 MiB β 742.34 KiB β 0.0082% β
β group3_properties β 545 β 67653790 β 1.02 MiB β 305.01 KiB β 64.52 MiB β 742.34 KiB β 0.0029% β
β $group_3 β 545 β 67653790 β 1.02 MiB β 304.92 KiB β 64.52 MiB β 742.34 KiB β 0.0029% β
β $group_4 β 545 β 67653790 β 1.02 MiB β 304.80 KiB β 64.52 MiB β 742.34 KiB β 0.0029% β
β group4_properties β 545 β 67653790 β 1.02 MiB β 304.80 KiB β 64.52 MiB β 742.34 KiB β 0.0029% β
βββββββββββββββββββββ΄ββββββββ΄βββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββββ΄ββββββββββββββββββββββ
~54% of space was taken up by the new columns. The main issue seems to be that the compression ratio isn't as good as for event properties.
This is likely an effect from our our sort key: (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid))
causes same events from a team to be nearer to each other, which would likely have more similar payloads, while person properties which change from user-to-user would be more disparate.
In an ideal world, using the new JSON data type would offset the extra space usage (thanks to storing the data more effectively) but this feature is not ready for our usage yet. See https://github.com/PostHog/posthog/issues/10506 for more details.
I'll explore alternative compression options in a follow-up comment as this much space usage hurts our bottom line a bit.
On my test instance, these tables took up the following amount of space.
ββdatabaseββ¬βnameββββββββββββββββββββββββββ¬βtotal_bytesββ¬βformatReadableSize(total_bytes)ββ
β default β tmp_person_0006 β 61290502818 β 57.08 GiB β
β default β tmp_person_distinct_id2_0006 β 25103022126 β 23.38 GiB β
β default β tmp_groups_0006 β 13317282 β 12.70 MiB β
ββββββββββββ΄βββββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββββββββββββ
We're (nearly) doubling the space used by persons but it's a rounding error compared to what's needed for events table.
The largest backfill tested (with ample memory for caching) took 1323s for ~224M events. This means the migration can be expected to run in less than 3 days if started on each shard separately (and relying on replication).
During this time queries against machines doing the update will be slower as the migration is doing a lot of I/O (I saw >100MB/s write speed regularly).
As such it'd be ideal to run this on a weekend, if needed on multiple weekends.
We could add a replica to each shard with a low priority for doing this migration, but given it's expected runtime this is likely not needed.
Sadly measuring all of this is taking a lot longer than I was expecting. I'm currently in the middle of:
Will post these summaries once they are ready.
I've summarized learnings on different codecs in https://github.com/PostHog/posthog/issues/10616, but the TL;DR is that I think we should use ZSTD(3) which provides a ~5.3x improvement over the default LZ4 compression.
I ran a few queries against the new schema, to compare the new and old schemas.
Note all queries are run both with and without page cache to provide realistic results.
Scripts and raw results can be found at https://github.com/PostHog/scratchpad/tree/main/karl/2022-07-14-person-on-events. For each tested queries, the median of 3 runs was taken.
This query should be unaffected, provided as a baseline
query | no cache duration | with page cache duration |
---|---|---|
Simple trend query | 243ms | 168ms |
query | no cache duration | with page cache duration |
---|---|---|
With JOIN against person_distinct_id2 | 1674ms | 1197ms |
Using events.person_id column | 133ms | 101ms |
Conclusion: This query is sped up significantly.
Note that these measurements are not perfect as below ~1 second queries might fluctate a lot.
The table includes queries against event.properties
column for reference
query | no cache duration | with page cache duration |
---|---|---|
Trend query filtering on event.properties |
5664ms | 2022ms |
Trend query filtering on event.properties (ZSTD(3) compression) |
3338ms | 2192ms |
Trend query filtering on a materialized event property column | 258ms | 165ms |
Trend query filtering on person property (with join against person table) * |
44072ms | 25187ms |
Trend query filtering on person property (with join against person table, materialized column) * |
44002ms | 25195ms |
Trend query filtering on event.person_properties |
420ms | 356ms |
Trend query filtering on event.person_properties (ZSTD(3) compression) |
418ms | 383ms |
Trend query filtering on a materialized property on event.person_properties column |
102ms | 61ms |
Note on * - these queries OOM-ed without increasing max_memory_usage
significantly.
Materialized column with join did not show a significant effect due to the size of the team under test.
Next, finally on to implementation!
Closing this out as 0007_persons_and_groups_on_events_backfill.py
has been in the wild for some time now and working.
If I am mistaken feel free to reopen π
Is your feature request related to a problem?
We're nearing completion with ingestion- and query-related changes for persons-on-events project. This ticket outlines the plan for migrating to that schema on cloud and self-hosted by default.
Current state
@yakkomajuri implemented posthog/management/commands/backfill_persons_and_groups_on_events.py, however the script has various issues:
ON CLUSTER
which would cause issues for us on cloud due to backup nodes and other racy behaviors.We're in a reasonably good state on cloud where we have ~60% of disk free to do this migration as needed.
One complicating factor is cloud infrastructure migration: Ideally we start this migration once posthog-worker has been moved to new infrastructure.
Describe the plan
system.clusters
table to get a representative node for each shard and use that in queries.You can find a WIP version of the backfill SQL here:
Ideally we will be able to build, test and ship this as part of release 1.38.0 - at least as an experimental migration.
Work left out of this async migration
person_id
. This is vital, but also tricky to implement at the same time.Thank you for your feature request β we love each and every one!