Closed macobo closed 7 months ago
Yeah so there's 2 actionable points here:
ReplicatedReplacingMergeTree
when CLICKHOUSE_REPLICATION
is set. Change the default schema for the ReplicatedReplacingMergeTree when CLICKHOUSE_REPLICATION is set.
Handled under https://github.com/PostHog/posthog/issues/8652
Write a special/big/async migration to correct it - have been thinking about this
If we do this, note this only affects posthog cloud and we might get away with something simpler.
Note that TODAY plugin logs will return random data instead of complete logs due to this issue.
Note from the future. The correct schema here would be to actually be sharding these tables and use a Distributed
engine table to query them. This avoids double-storing each event.
A similar architecture is described under https://posthog.com/handbook/engineering/databases/event-ingestion under sharded events ingestion.
I don't think this is an issue anymore, but let me know otherwise @fuziontech
Bug description
Some of our table schemas are seemingly set up wrong on Cloud:
ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/posthog.events_dead_letter_queue', '{replica}', _timestamp) ORDER BY (id, event_uuid, distinct_id, team_id) TTL toDate(_timestamp) + toIntervalWeek(4) SETTINGS index_granularity = 512
ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/posthog.plugin_log_entries', '{replica}', _timestamp) PARTITION BY plugin_id ORDER BY (team_id, id) TTL toDate(timestamp) + toIntervalWeek(1) SETTINGS index_granularity = 512
This means that these tables are sharded and when we
sync_execute
a query these will return data from a random shard instead of all of them.Metabase query
Expected behavior
I believe these tables should not be sharded, so the engine call should look something like others, e.g.
Environment
Only affects PostHog Cloud
Additional context
cc @yakkomajuri @guidoiaquinti @tiina303 @fuziontech - something for your backlog
Note that this is one of the cases where CLICKHOUSE_REPLICATION sets things up in the wrong way. Luckily only used on cloud :crossed_fingers:
Why this issue matters? We've run into issues where cloud set up tables are wrong multiple times already, costing hours (days?) of work and in the case of groups, it contributed to losing a weeks worth of data.
Other oddities
These I'm not sure of:
ReplicatedReplacingMergeTree('/clickhouse/prod/tables/noshard/person_static_cohort', '{replica}', _timestamp)
ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/posthog.plugin_log_entries', '{replica}', _timestamp)
In other noshard cases the replica_key is
{replica}-{shard}
so is using{replica}
even correct?Thank you for your bug report – we love squashing them!