KafkaEngine doesn't support DEFAULT/MATERIALIZED/EPHEMERAL/ALIAS expressions for columns

fdaugs commented 1 year ago

Bug description

When upgrading self hostet to lastest. Clickhouse does not start anymore.

How to reproduce

Run hobby upgrade script with latest as app tag.
The script will fail at the asyncmigrationscheck

Environment

[x] PostHog Hobby self-hosted with docker compose,commit: 5f936b45f0719ac0ce59fb8310a5a9810d2f1781

Additional context

This is the error in the Clickhouse service:

 <Error> Application: Code: 36. DB::Exception: KafkaEngine doesn't support DEFAULT/MATERIALIZED/EPHEMERAL/ALIAS expressions for columns. See https://clickhouse.com/docs/en/engines/table-engines/integrations/kafka/#configuration: Cannot attach table `posthog`.`kafka_person_distinct_id2` from metadata file /var/lib/clickhouse/store/1f4/1f4378b6-8d1b-455f-87dd-3d08121e4b2d/kafka_person_distinct_id2.sql from query ATTACH TABLE posthog.kafka_person_distinct_id2 UUID 'cd9338eb-598b-43c1-adcc-8fbe8b785e9b' (`team_id` Int64, `distinct_id` String, `person_id` UUID, `is_deleted` Int8, `version` Int64 DEFAULT 1) ENGINE = Kafka('kafka', 'clickhouse_person_distinct_id', 'group1', 'JSONEachRow'). (BAD_ARGUMENTS), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3bb175 in /usr/bin/clickhouse
1. ? @ 0x145779b6 in /usr/bin/clickhouse
2. ? @ 0x14574c6c in /usr/bin/clickhouse
3. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x14294dbb in /usr/bin/clickhouse
4. DB::createTableFromAST(DB::ASTCreateQuery, String const&, String const&, std::shared_ptr<DB::Context>, bool) @ 0x12e3c83c in /usr/bin/clickhouse
5. DB::DatabaseOrdinary::loadTableFromMetadata(std::shared_ptr<DB::Context>, String const&, DB::QualifiedTableName const&, std::shared_ptr<DB::IAST> const&, DB::LoadingStrictnessLevel) @ 0x12e5bc37 in /usr/bin/clickhouse
6. ? @ 0x12eb033f in /usr/bin/clickhouse
7. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0xe490475 in /usr/bin/clickhouse
8. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, long, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0xe492cb5 in /usr/bin/clickhouse
9. ThreadPoolImpl<std::thread>::worker(std::__list_iterator<std::thread, void*>) @ 0xe48c2a1 in /usr/bin/clickhouse
10. ? @ 0xe491b21 in /usr/bin/clickhouse
11. ? @ 0x7fb389043b43 in ?
12. clone @ 0x7fb3890d4bb4 in ?
 (version 23.4.2.11 (official build))
2023.05.19 13:56:38.043795 [ 49 ] {} <Error> Application: DB::Exception: KafkaEngine doesn't support DEFAULT/MATERIALIZED/EPHEMERAL/ALIAS expressions for columns. See https://clickhouse.com/docs/en/engines/table-engines/integrations/kafka/#configuration: Cannot attach table `posthog`.`kafka_person_distinct_id2` from metadata file /var/lib/clickhouse/store/1f4/1f4378b6-8d1b-455f-87dd-3d08121e4b2d/kafka_person_distinct_id2.sql from query ATTACH TABLE posthog.kafka_person_distinct_id2 UUID 'cd9338eb-598b-43c1-adcc-8fbe8b785e9b' (`team_id` Int64, `distinct_id` String, `person_id` UUID, `is_deleted` Int8, `version` Int64 DEFAULT 1) ENGINE = Kafka('kafka', 'clickhouse_person_distinct_id', 'group1', 'JSONEachRow')

thelazyoxymoron commented 1 year ago

I'm getting the same error. @fdaugs did you manage to figure this out by any chance?

fdaugs commented 1 year ago

@thelazyoxymoron No, sadly not. I posted it into the slack workspace twice, but nobody reacted. We're kinda forced to migrate to Cloud now...

thelazyoxymoron commented 1 year ago

Yes, I realized that as well. All the self-hosted related queries in slack/github go unanswered now.

eriksargent commented 11 months ago

I had the same problem, and after a few hours of debugging I've got it fixed and everything on my posthog cluster is running great again. An actual code fix from Posthog would be ideal, but in the mean time, here's my steps to fix the problem. I had to do this on two different clusters that had the same problem, and the steps were exactly the same.

My use case, however, is specifically for people still using Kubernetes for deployment. If you're on Docker, I suspect you could modify these steps for the Docker deployment and get the same outcome.

To start, this article was incredibly helpful: https://altinity.com/blog/fixing-the-dreaded-clickhouse-crash-loop-on-kubernetes. To fix the problem we need to force the clickhouse server to start without actually running clickhouse-server, and then keep it running. That's done by adding the following to the ClickHouseInstallation (read the article for more information on where it goes):

 # Add command to bring up pod and stop.
command:
  - "/bin/bash"
  - "-c"
  - "sleep 9999999"
# Fix liveness probe so that we won't look for ClickHouse.
livenessProbe:
  exec:
    command:
    - ls
  initialDelaySeconds: 5
  periodSeconds: 5

Once the clickhouse pod is running, exec into it and cd into /var/lib/clickhouse/metadata/posthog. The fix is going to be to modify the sql schema to remove the reference to the DEFAULT value for a column which is the bit that the KafkaEngine doesn't have support for. There's two files in the database metadata that have this problem, kafka_person_distinct_id2.sql and kafka_person.sql. If you print out the contents of these you can see they both have a DEFAULT field on a column in the table. Remove those defualts, and the server will work again. Here's the commands I used to replace the files, but you should obviously compare my schema with the one on your server before overwriting it.

cat > kafka_person_distinct_id2.sql<< EOF
ATTACH TABLE kafka_person_distinct_id2
(
    \`team_id\` Int64,
    \`distinct_id\` String,
    \`person_id\` UUID,
    \`is_deleted\` Int8,
    \`version\` Int64 
)
ENGINE = Kafka('posthog-posthog-kafka:9092', 'clickhouse_person_distinct_id', 'group1', 'JSONEachRow')
EOF

cat > kafka_person.sql<< EOF
ATTACH TABLE kafka_person
(
    \`id\` UUID,
    \`created_at\` DateTime64(3),
    \`team_id\` Int64,
    \`properties\` String,
    \`is_identified\` Int8,
    \`is_deleted\` Int8,
    \`version\` UInt64
)
ENGINE = Kafka('posthog-posthog-kafka:9092', 'clickhouse_person', 'group1', 'JSONEachRow')
EOF

After changing the files, run clickhouse manually by calling clickhouse-server -C /etc/clickhouse-server/config.xml. If there are any other kafka schema files that need to be updated you'll see the crash happen pretty quickly and can go fix those and repeat. For me that's all it took to fix the issues and get clickhouse running again. I left it running in this state for a few minutes to make sure everything was working, then reverted the config on the ClickHouseInstallation and redeployed the clickhouse pod to get it managed by kubernetes and automatically running the server again.

I really hope that helps someone!

molind commented 7 months ago

I was able to fix it by connecting to the clickhouse container. docker-compose run clickhouse bash then I've fixed everything inside.

Thanks for detailed explanation. I'm curious when it's happened and why?

PostHog / posthog