Open fdaugs opened 1 year ago
I'm getting the same error. @fdaugs did you manage to figure this out by any chance?
@thelazyoxymoron No, sadly not. I posted it into the slack workspace twice, but nobody reacted. We're kinda forced to migrate to Cloud now...
Yes, I realized that as well. All the self-hosted related queries in slack/github go unanswered now.
I had the same problem, and after a few hours of debugging I've got it fixed and everything on my posthog cluster is running great again. An actual code fix from Posthog would be ideal, but in the mean time, here's my steps to fix the problem. I had to do this on two different clusters that had the same problem, and the steps were exactly the same.
My use case, however, is specifically for people still using Kubernetes for deployment. If you're on Docker, I suspect you could modify these steps for the Docker deployment and get the same outcome.
To start, this article was incredibly helpful: https://altinity.com/blog/fixing-the-dreaded-clickhouse-crash-loop-on-kubernetes. To fix the problem we need to force the clickhouse server to start without actually running clickhouse-server, and then keep it running. That's done by adding the following to the ClickHouseInstallation (read the article for more information on where it goes):
# Add command to bring up pod and stop.
command:
- "/bin/bash"
- "-c"
- "sleep 9999999"
# Fix liveness probe so that we won't look for ClickHouse.
livenessProbe:
exec:
command:
- ls
initialDelaySeconds: 5
periodSeconds: 5
Once the clickhouse pod is running, exec into it and cd into /var/lib/clickhouse/metadata/posthog
. The fix is going to be to modify the sql schema to remove the reference to the DEFAULT value for a column which is the bit that the KafkaEngine doesn't have support for. There's two files in the database metadata that have this problem, kafka_person_distinct_id2.sql
and kafka_person.sql
. If you print out the contents of these you can see they both have a DEFAULT
field on a column in the table. Remove those defualts, and the server will work again. Here's the commands I used to replace the files, but you should obviously compare my schema with the one on your server before overwriting it.
cat > kafka_person_distinct_id2.sql<< EOF
ATTACH TABLE kafka_person_distinct_id2
(
\`team_id\` Int64,
\`distinct_id\` String,
\`person_id\` UUID,
\`is_deleted\` Int8,
\`version\` Int64
)
ENGINE = Kafka('posthog-posthog-kafka:9092', 'clickhouse_person_distinct_id', 'group1', 'JSONEachRow')
EOF
cat > kafka_person.sql<< EOF
ATTACH TABLE kafka_person
(
\`id\` UUID,
\`created_at\` DateTime64(3),
\`team_id\` Int64,
\`properties\` String,
\`is_identified\` Int8,
\`is_deleted\` Int8,
\`version\` UInt64
)
ENGINE = Kafka('posthog-posthog-kafka:9092', 'clickhouse_person', 'group1', 'JSONEachRow')
EOF
After changing the files, run clickhouse manually by calling clickhouse-server -C /etc/clickhouse-server/config.xml
. If there are any other kafka schema files that need to be updated you'll see the crash happen pretty quickly and can go fix those and repeat. For me that's all it took to fix the issues and get clickhouse running again. I left it running in this state for a few minutes to make sure everything was working, then reverted the config on the ClickHouseInstallation and redeployed the clickhouse pod to get it managed by kubernetes and automatically running the server again.
I really hope that helps someone!
I was able to fix it by connecting to the clickhouse container.
docker-compose run clickhouse bash
then I've fixed everything inside.
Thanks for detailed explanation. I'm curious when it's happened and why?
Bug description
When upgrading self hostet to lastest. Clickhouse does not start anymore.
How to reproduce
latest
as app tag.Environment
docker compose
,commit: 5f936b45f0719ac0ce59fb8310a5a9810d2f1781Additional context
This is the error in the Clickhouse service: