PostHog / charts-clickhouse

Helm chart for deploying PostHog with ClickHouse on your K8s infrastructure
MIT License
56 stars 74 forks source link

Migration job fails with Table posthog.infi_clickhouse_orm_migrations doesn't exist #502

Open charlrvd opened 2 years ago

charlrvd commented 2 years ago

Bug description

New installation of the chart fails on migrations. Therefore the deployment is not usable as the plugins, web, worker and events are not passing the init phase.

The SQL migrations run fine but soon after the following error happens

πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»
Skipping async migrations setup. This is unsafe in production!
πŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”Ί

AXES: BEGIN LOG
AXES: BEGIN LOG
AXES: Using django-axes version 5.9.0
AXES: Using django-axes version 5.9.0
AXES: blocking by IP only.
AXES: blocking by IP only.
Report for e1561f5b65f841e5cd9c5ccbee3d1a1f:
{'chart_version': '26.2.0',
 'cloud': 'aws',
 'deployment': 'helm_aws_ha',
 'deployment_type': 'helm',
 'hostname': 'posthog.domain',
 'ingress_type': 'nginx',
 'kube_version': 'v1.22.10-eks-84b4fe6',
 'operation': 'install',
 'posthog_version': '1.38.0',
 'release_name': 'posthog',
 'release_revision': 1}

πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»
Skipping async migrations setup. This is unsafe in production!
πŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”Ί

AXES: BEGIN LOG
AXES: BEGIN LOG
AXES: Using django-axes version 5.9.0
AXES: Using django-axes version 5.9.0
AXES: blocking by IP only.
AXES: blocking by IP only.
Operations to perform:
  Apply all migrations: admin, auth, axes, contenttypes, ee, posthog, rest_hooks, sessions, social_django
Running migrations:
  No migrations to apply.

πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»πŸ”»
Skipping async migrations setup. This is unsafe in production!
πŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”ΊπŸ”Ί

AXES: BEGIN LOG
AXES: BEGIN LOG
AXES: Using django-axes version 5.9.0
AXES: Using django-axes version 5.9.0
AXES: blocking by IP only.
AXES: blocking by IP only.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 362, in _get_applied_migrations_and_create_tables
    return self._get_applied_migrations(migrations_package_name, replicated)
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 379, in _get_applied_migrations
    return set(obj.module_name for obj in self.select(query))
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 379, in <genexpr>
    return set(obj.module_name for obj in self.select(query))
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 276, in select
    r = self._send(query, settings, True)
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 389, in _send
    raise ServerError(r.text)
infi.clickhouse_orm.database.ServerError: Code: 60. DB::Exception: Table posthog.infi_clickhouse_orm_migrations doesn't exist. (UNKNOWN_TABLE) (version 22.3.6.5 (official build))
 (0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "manage.py", line 21, in <module>
    main()
  File "manage.py", line 17, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/code/posthog/management/commands/migrate_clickhouse.py", line 42, in handle
    self.migrate(CLICKHOUSE_HTTP_URL, options)
  File "/code/posthog/management/commands/migrate_clickhouse.py", line 80, in migrate
    database.migrate(MIGRATIONS_PACKAGE_NAME, options["upto"], replicated=CLICKHOUSE_REPLICATION)
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 346, in migrate
    applied_migrations = self._get_applied_migrations_and_create_tables(migrations_package_name, replicated=replicated)
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 371, in _get_applied_migrations_and_create_tables
    return self._get_applied_migrations_and_create_tables(migrations_package_name, replicated, allow_missing_tables=False)
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 359, in _get_applied_migrations_and_create_tables
    return self._get_applied_migrations(migrations_package_name, replicated)
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 379, in _get_applied_migrations
    return set(obj.module_name for obj in self.select(query))
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 379, in <genexpr>
    return set(obj.module_name for obj in self.select(query))
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 276, in select
    r = self._send(query, settings, True)
  File "/usr/local/lib/python3.8/site-packages/infi/clickhouse_orm/database.py", line 389, in _send
    raise ServerError(r.text)
infi.clickhouse_orm.database.ServerError: Code: 60. DB::Exception: Table posthog.infi_clickhouse_orm_migrations doesn't exist. (UNKNOWN_TABLE) (version 22.3.6.5 (official build))
 (0)

Expected behavior

The installation completes after the migrations all runs to completion

Actual behavior

Migration fails to run on new deployment.

How to reproduce

Deploying the helm chart with

helm upgrade -install \
  -f values.yaml \
  --timeout 20m \
  --namespace posthog \
  posthog posthog/posthog \
  --wait --wait-for-jobs --debug

tried both in upgrade --install and just install but if fails with the same issues.

Environment

Additional context

guidoiaquinti commented 2 years ago

πŸ‘‹ Hey @charlrvd, can you please share your values.yaml (redacting any sensitive information)?

charlrvd commented 2 years ago

Sure, here it is

cloud: "aws"
ingress:
  hostname: posthog.domain
  nginx:
    enabled: true
  letsencrypt: false
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt
  secretName: posthog-tls-cert
cert-manager:
  enabled: false

clickhouse:
  resources:
    limits:
      cpu: 400m
      memory: 6Gi
    requests:
      cpu: 400m
      memory: 6Gi
  persistence:
    enabled: true
    storageClass: standard
    size: 100Gi

kafka:
  replicaCount: 3
  resources:
    requests:
      cpu: 100m
      memory: 2Gi
    limits:
      cpu: 250m
      memory: 2Gi
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
    timeoutSeconds: 15
    failureThreshold: 3
    periodSeconds: 30
    successThreshold: 1
  readinessProbe:
    enabled: true
    initialDelaySeconds: 15
    failureThreshold: 6
    timeoutSeconds: 15
    periodSeconds: 30
    successThreshold: 1
  persistence:
    storageClass: standard
    size: 50Gi
  metrics:
    kafka:
      enabled: true

redis:
  architecture: replication
  replica:
    replicaCount: 3
    resources:
      limits:
        cpu: 250m
        memory: 256Mi
      requests:
        cpu: 250m
        memory: 256Mi
  master:
    persistence:
      storageClass: standard
      size: 20Gi

zookeeper:
  persistence:
    storageClass: standard
    size: 20Gi

postgresql:
  enabled: false
externalPostgresql:
  postgresqlHost: posthog.rds.amazonaws.com
  postgresqlPort: 5432
  postgresqlDatabase: posthog
  postgresqlUsername: admin
  existingSecret: postgres-password
  existingSecretPasswordKey: password

grafana:
  enabled: true
prometheus:
  enabled: true
  alertmanager:
    persistentVolume:
      storageClass: standard
      size: 10Gi
  server:
    persistentVolume:
      storageClass: standard
      size: 100Gi

prometheus-kafka-exporter:
  enabled: true
prometheus-postgres-exporter:
  enabled: true
  config:
    datasource:
      host: posthog.rds.amazonaws.com
      user: admin
      database: posthog
      passwordSecret:
        name: postgres-password
        key: password
prometheus-redis-exporter:
  enabled: true
charlrvd commented 2 years ago

Mh, I realized if I run the install on a brand new namespace, with no pre-existing PVC, then the install goes all right. I think my issue is that I have pre-existing PVC from an install that went nuts after an upgrade. I upgraded, with the same process as usual, from 26.0.1 to 26.0.9. This upgrade worked but for the first time in 6 months of running posthog, it created a new PV and PVC for clickhouse, so all the old data was missing. I created a PVC manually from the existing PV and set the value clickhouse.persistence.existingClaim and did the exact same command to upgrade the to version it's already at. But that failed with this error

Set configuration api_version=(2, 5, 0) to skip auto check_version requests on startup
2022-08-01T23:37:58.098465Z [info     ] Schema out of sync on some clickhouse nodes! [ee.management.commands.sync_replicated_schema] out_of_sync_hosts={'chi-posthog-posthog-0-0-0': {'person_distinct_id2_mv', 'groups_mv', 'events_dead_letter_queue', 'events_dead_letter_queue_mv', 'sharded_events', 'person', 'sharded_session_recording_events', 'events_json_mv', 'plugin_log_entries', 'groups', 'cohortpeople', 'person_static_cohort', 'session_recording_events_mv', 'plugin_log_entries_mv', 'person_distinct_id_mv', 'person_distinct_id2', 'person_mv'}}
2022-08-01T23:37:58.098828Z [info     ] Creating missing tables        [ee.management.commands.sync_replicated_schema] missing_tables={'person_distinct_id2_mv', 'groups_mv', 'events_dead_letter_queue', 'events_dead_letter_queue_mv', 'sharded_events', 'person', 'sharded_session_recording_events', 'events_json_mv', 'plugin_log_entries', 'groups', 'cohortpeople', 'person_static_cohort', 'session_recording_events_mv', 'person_mv', 'person_distinct_id_mv', 'person_distinct_id2', 'plugin_log_entries_mv'}
Traceback (most recent call last):
  File "manage.py", line 21, in <module>
    main()
  File "manage.py", line 17, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/home/posthog/code/ee/management/commands/sync_replicated_schema.py", line 52, in handle
    self.create_missing_tables(out_of_sync_hosts, create_table_queries)
  File "/home/posthog/code/ee/management/commands/sync_replicated_schema.py", line 99, in create_missing_tables
    query = create_table_queries[table]
KeyError: 'person_distinct_id2_mv'
Closing the Kafka producer with 0 secs timeout.
Proceeding to force close the producer since pending requests could not be completed within timeout 0.

And after that I decided since I can't seem to get it to run on the old data because the migrations are messed up I'll just reuse an empty disk. I ended up deleting the helm release to "start fresh" and left the undelete PVC since I thought they'd be ok. And that's when I got the error in the original post.

Because it actually works in a blank namespace I assume kafka or maybe redis data would be having something to do with migrations and causing my issues.

hazzadous commented 2 years ago

created a new PV and PVC for clickhouse

Do you happen to have output from the helm command on when this happened, I would be interested to see why it didn't use the existing claim. I don't think there were any changes from 26.0.1 to 26.0.9 regarding PVCs πŸ€”

I'll try to set aside some time to reproduce the issue on Monday if that's ok? It's not the first time this has been an issue, and it's pretty a critical issue.

charlrvd commented 2 years ago

Unfortunately I don't have the outputs anymore. I am not sure if that would be a Chart issue or maybe a Helm blip. It's the only chart I have that I deploy via helm. The rest is handled by ArgoCD and when it's using helm it always templates | apply so the behavior might be different. But I had run upgrades on that chart fine more than a dozen times before, the only difference is that I change the --version X.Y.X so that I don't upgrade while missing a needed version if it's been a while.

Anytime is fine with me. That bug is more informative now. I screwed up by having waited too long to use real backups and I hadn't dig enough in the way the data is stored. So even with my clickhouse disk it seems like I am missing data that must have been in the kafka disks which I cleaned up recently.

tushar5526 commented 1 year ago

Had the same issue while I was deploying to docker swarm. Fixed it by using the right hostname for zookeeper in clickhouse's config. All the migrations were then applied correctly.

Specifically here -> https://github.com/PostHog/posthog/blob/master/docker/clickhouse/config.xml#L695

h1manshu98 commented 1 year ago

Can I get update on this issue ? I'm getting the same error while updating image (upgrading application) Is there any other way than deleting PVC or doing fresh installation ?

can somebody tell me, by which Dockerfile this image is created https://hub.docker.com/r/posthog/posthog-foss ? As we are trying it to be self-hosted. Thanks!

stevenbressey commented 1 year ago

Hello. I also ran on the same error while deploying this chart with an external clickhouse from official Helm chart . This is a fresh installation with no data. Every time the migrate job runs I get the following error: infi.clickhouse_orm.database.ServerError: Code: 60. DB::Exception: Table posthog.infi_clickhouse_orm_migrations doesn't exist

Any idea how to fix this ?

ZILosoft commented 2 months ago

any update?