waiting for ClickHouse cluster to become available

visla-xugeng commented 2 years ago

Bug description

I tried to install the PostHog in a brand new EKS in AWS by using helm commands. But still get several pods in Init:1/2 status.

Expected behavior

All pods should be up running

Actual behavior

Several pods in Init:1/2 status and get some errors from pods

How to reproduce

Follow the instruction, you can get these errors

helm repo add posthog https://posthog.github.io/charts-clickhouse/
helm repo update
helm upgrade --install -f values.yaml --timeout 20m --create-namespace --namespace posthog posthog posthog/posthog --wait --wait-for-jobs --debug

Environment

I deployed the chart in EKS on AWS

Additional context


kubectl -n posthog get pods
NAME                                           READY   STATUS     RESTARTS        AGE
chi-posthog-posthog-0-0-0                      1/1     Running    0               4h12m
clickhouse-operator-8cff468-nmsc2              2/2     Running    0               4h12m
posthog-events-57898f9c54-55vqs                0/1     Init:1/2   0               4h12m
posthog-migrate-2022-05-09-18-24-25--1-9fcjn   1/1     Running    0               4h12m
posthog-pgbouncer-7dff4bb4b7-ngmbl             1/1     Running    0               4h12m
posthog-plugins-546d5f4575-zwsrb               0/1     Init:1/2   0               4h12m
posthog-posthog-kafka-0                        1/1     Running    1 (4h12m ago)   4h12m
posthog-posthog-zookeeper-0                    1/1     Running    0               4h12m
posthog-web-549c67c68d-d86zh                   0/1     Init:1/2   0               4h12m
posthog-worker-769ff86779-2nldx                0/1     Init:1/2   0               4h12m

logs from pod, posthog-migrate , container, wait-for-service-dependencies


waiting for ClickHouse cluster to become available
wget: bad address 'clickhouse-posthog.posthog.svc.cluster.local:8123'
waiting for ClickHouse cluster to become available
wget: bad address 'clickhouse-posthog.posthog.svc.cluster.local:8123'
waiting for ClickHouse cluster to become available
wget: bad address 'clickhouse-posthog.posthog.svc.cluster.local:8123'
waiting for ClickHouse cluster to become available

waiting for ClickHouse cluster to become available
wget: can't connect to remote host (172.20.83.1): Connection refused
waiting for ClickHouse cluster to become available
waiting for ClickHouse cluster to become available
wget: can't connect to remote host (172.20.83.1): Connection refused
wget: can't connect to remote host (172.20.83.1): Connection refused
waiting for ClickHouse cluster to become available
wget: can't connect to remote host (172.20.83.1): Connection refused
waiting for ClickHouse cluster to become available
1
posthog-pgbouncer.posthog.svc.cluster.local (172.20.196.145:6543) open
posthog-posthog-kafka.posthog.svc.cluster.local (172.20.155.30:9092) open

logs from pod, posthog-events, container, wait-for-service-dependencies


waiting for ClickHouse cluster to become available
wget: bad address 'clickhouse-posthog.posthog.svc.cluster.local:8123'
waiting for ClickHouse cluster to become available
wget: bad address 'clickhouse-posthog.posthog.svc.cluster.local:8123'
waiting for ClickHouse cluster to become available
wget: bad address 'clickhouse-posthog.posthog.svc.cluster.local:8123'
waiting for ClickHouse cluster to become available

waiting for ClickHouse cluster to become available
wget: can't connect to remote host (172.20.83.1): Connection refused
waiting for ClickHouse cluster to become available
waiting for ClickHouse cluster to become available
wget: can't connect to remote host (172.20.83.1): Connection refused
wget: can't connect to remote host (172.20.83.1): Connection refused
waiting for ClickHouse cluster to become available
wget: can't connect to remote host (172.20.83.1): Connection refused
waiting for ClickHouse cluster to become available
1
posthog-pgbouncer.posthog.svc.cluster.local (172.20.196.145:6543) open
posthog-posthog-kafka.posthog.svc.cluster.local (172.20.155.30:9092) open

kubectl -n posthog get svc
NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                         AGE
chi-posthog-posthog-0-0              ClusterIP   None             <none>        8123/TCP,9000/TCP,9009/TCP      4h6m
clickhouse-operator-metrics          ClusterIP   172.20.40.74     <none>        8888/TCP                        4h6m
clickhouse-posthog                   NodePort    172.20.83.1      <none>        8123:30105/TCP,9000:32025/TCP   4h6m

visla-xugeng commented 2 years ago

@guidoiaquinti Hi Guido, After further review and test on my side, I think the issue is more related to clickhouse. (Not related to external redis or external postgresql) The migrate pod cannot connect to clickhouse cluster. As a result, the container, wait-for-service-dependencies, failed to complete.

Do you have any idea here? Thanks

guidoiaquinti commented 2 years ago

👋 Hi @visla-xugeng! I'm sorry to hear about this issue. Can you please check both CH pods to see if there's anything out of the ordinary in the logs?

We've been tracking a possible regression in the upstream Altinity/clickhouse-operator where a healthy CH pod doesn't get marked as healthy by the operator pod.

It happens when the clickhouse.altinity.com/ready=yes label doesn't get created and attached by the operator pod to the CH pod. This leads to the CH service to don't have any pod associated -> PostHog being unhappy.

Can you please verify the above? The current workaround is to manually kill the operator pod.

visla-xugeng commented 2 years ago

@guidoiaquinti Thanks for your response. I checked the logs, did not see what you mentioned. You can see that my chi-posthog-posthog-0-0-0 has clickhouse.altinity.com/ready=yes attached.

kubectl -n posthog describe pod chi-posthog-posthog-0-0-0
Name:         chi-posthog-posthog-0-0-0
Namespace:    posthog
Priority:     0
Node:         ip-10-192-83-121.cn-northwest-1.compute.internal/10.192.83.121
Start Time:   Tue, 10 May 2022 10:43:56 -0700
Labels:       app.kubernetes.io/managed-by=Helm
              clickhouse.altinity.com/app=chop
              clickhouse.altinity.com/chi=posthog
              clickhouse.altinity.com/cluster=posthog
              clickhouse.altinity.com/namespace=posthog
              clickhouse.altinity.com/ready=yes
              clickhouse.altinity.com/replica=0
              clickhouse.altinity.com/settings-version=208b41823c98541c9c4f31abd14664612d884c1a
              clickhouse.altinity.com/shard=0
              clickhouse.altinity.com/zookeeper-version=006911d48c8f8eb94431829be3125b34ad63f661
              controller-revision-hash=chi-posthog-posthog-0-0-78bdf996fb
              statefulset.kubernetes.io/pod-name=chi-posthog-posthog-0-0-0

From the log of chi-posthog-posthog-0-0-0, I see one error about zookeeper. However, from the zookeeper pod, I did not see any errors.

2022.05.10 16:26:32.352281 [ 7 ] {} <Information> Application: Ready for connections.
2022.05.10 16:26:35.367684 [ 45 ] {} <Error> virtual bool DB::DDLWorker::initializeMainThread(): Code: 999, e.displayText() = Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes: 172.20.202.144:2181
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Timeout: connect timed out: 172.20.202.144:2181 (version 21.6.5.37 (official build)), 172.20.202.144:2181
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Timeout: connect timed out: 172.20.202.144:2181 (version 21.6.5.37 (official build)), 172.20.202.144:2181
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Timeout: connect timed out: 172.20.202.144:2181 (version 21.6.5.37 (official build)), 172.20.202.144:2181
 (Connection loss), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8b6cbba in /usr/bin/clickhouse
1. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 0x107c1635 in /usr/bin/clickhouse
2. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error) @ 0x107c18c2 in /usr/bin/clickhouse
3. Coordination::ZooKeeper::connect(std::__1::vector<Coordination::ZooKeeper::Node, std::__1::allocator<Coordination::ZooKeeper::Node> > const&, Poco::Timespan) @ 0x1080132b in /usr/bin/clickhouse
4. Coordination::ZooKeeper::ZooKeeper(std::__1::vector<Coordination::ZooKeeper::Node, std::__1::allocator<Coordination::ZooKeeper::Node> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Poco::Timespan, Poco::Timespan, Poco::Timespan) @ 0x107ffb7b in /usr/bin/clickhouse
5. zkutil::ZooKeeper::init(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x107c403e in /usr/bin/clickhouse
6. zkutil::ZooKeeper::ZooKeeper(Poco::Util::AbstractConfiguration const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x107c5db6 in /usr/bin/clickhouse
7. void std::__1::allocator_traits<std::__1::allocator<zkutil::ZooKeeper> >::__construct<zkutil::ZooKeeper, Poco::Util::AbstractConfiguration const&, char const (&) [10]>(std::__1::integral_constant<bool, true>, std::__1::allocator<zkutil::ZooKeeper>&, zkutil::ZooKeeper*, Poco::Util::AbstractConfiguration const&, char const (&) [10]) @ 0xf54ac37 in /usr/bin/clickhouse
8. DB::Context::getZooKeeper() const @ 0xf5267a5 in /usr/bin/clickhouse
9. DB::DDLWorker::getAndSetZooKeeper() @ 0xf56b3d6 in /usr/bin/clickhouse
10. DB::DDLWorker::initializeMainThread() @ 0xf580155 in /usr/bin/clickhouse
11. DB::DDLWorker::runMainThread() @ 0xf569091 in /usr/bin/clickhouse
12. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0xf581131 in /usr/bin/clickhouse
13. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x8bacedf in /usr/bin/clickhouse
14. ? @ 0x8bb0403 in /usr/bin/clickhouse
15. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
16. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

Log from zookeeper. (Assume the zookeeper is running up and ready)

zookeeper 16:26:24.97 
zookeeper 16:26:24.97 Welcome to the Bitnami zookeeper container
zookeeper 16:26:24.97 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-zookeeper
zookeeper 16:26:24.97 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-zookeeper/issues
zookeeper 16:26:24.97 
zookeeper 16:26:24.98 INFO  ==> ** Starting ZooKeeper setup **
zookeeper 16:26:25.00 WARN  ==> You have set the environment variable ALLOW_ANONYMOUS_LOGIN=yes. For safety reasons, do not use this flag in a production environment.
zookeeper 16:26:25.04 INFO  ==> Initializing ZooKeeper...
zookeeper 16:26:25.04 INFO  ==> No injected configuration file found, creating default config files...
zookeeper 16:26:25.10 INFO  ==> No additional servers were specified. ZooKeeper will run in standalone mode...
zookeeper 16:26:25.11 INFO  ==> Deploying ZooKeeper from scratch...
zookeeper 16:26:25.12 INFO  ==> ** ZooKeeper setup finished! **

zookeeper 16:26:25.14 INFO  ==> ** Starting ZooKeeper **
/opt/bitnami/java/bin/java
ZooKeeper JMX enabled by default
Using config: /opt/bitnami/zookeeper/bin/../conf/zoo.cfg

visla-xugeng commented 2 years ago

Update: The error log of chi-posthog-posthog-0-0-0 mentioned above was gone after I reboot the pod. No more connection timeout issue in this pod.

Troubleshooting came back to the original point. Let me re-organize my findings here. 1: Pgbouncer still has lots of error log. Based on this https://github.com/pgbouncer/pgbouncer/issues/323. Looks like this error message does not matter, but not sure. Several pods, Events, Plugins, Web and Worker, still use Pgbouncer as proxy to connect external PostgreSQL. please help verify if this error message has any impact here.


2022-05-10 21:41:04.348 UTC [1] LOG C-0x7f23577a6510: (nodb)/(nouser)@10.192.158.50:36584 closing because: client unexpected eof (age=0s)
2022-05-10 21:41:14.346 UTC [1] LOG C-0x7f23577a6510: (nodb)/(nouser)@10.192.158.50:36738 closing because: client unexpected eof (age=0s)

2: The job, posthog-migrate-2022-05-10-09-24-27--1-zdlwx, still in running status not complete. It has 2 containers inside. One is wait-for-service-dependencies which is terminated and completed. I think this is good. The second container is migrate-job, which is still running but no log shown up. I think something got stuck, but could not figure it out.

kubectl -n posthog logs posthog-migrate-2022-05-10-09-24-27--1-zdlwx migrate-job -f
###nothing is here

kubectl -n posthog logs posthog-migrate-2022-05-10-09-24-27--1-zdlwx wait-for-service-dependencies -f
1
posthog-pgbouncer.posthog.svc.cluster.local (172.20.168.143:6543) open
posthog-posthog-kafka.posthog.svc.cluster.local (172.20.126.179:9092) open

3: Other pods, Event, plugins, web, worker, are all in "Init:1/2" status. All these pods have 3 containers inside. Only one is completed and terminated, which is wait-for-service-dependencies. Other two, wait-for-migrations (in running status) and posthog-events (in waiting status). I think this is related to the pod, posthog-migrate-2022-05-10-09-24-27--1-zdlwx. Since this pod could not finish it's job, so all other pods are waiting for the update. As a result, all of these pods cannot be up and running.

In general, based on the analysis above, I think the issue is on the pod, posthog-migrate-2022-05-10-09-24-27--1-zdlwx. But I cannot dig deeper. Do you have any idea?

hazzadous commented 2 years ago

Pgbouncer "unexpected eof" errors are unlikely to be an issue
Common timeouts include e.g. incorrect security group settings. Might this be an issue. Can you also provide the kubectl describe of the migration pod? Can you provide the full yaml manifest? You can try kubectl editing the containers command definition, add set -x to get the lines its running printed to output. Then we can see where it is getting stuck.
makes sense

visla-xugeng commented 2 years ago

describe of migrate pod

kubectl -n posthog describe pod posthog-migrate-2022-05-12-01-07-44--1-6qjr7
Name:         posthog-migrate-2022-05-12-01-07-44--1-6qjr7
Namespace:    posthog
Priority:     0
Node:         ip-10-192-189-39.xxxxxxxxxxxxxxxxxxxxxxx/10.192.189.39
Start Time:   Thu, 12 May 2022 01:11:46 -0700
Labels:       app=posthog
              controller-uid=5687728c-d1b7-4ebe-b1c6-2d9c4ab39e4c
              job-name=posthog-migrate-2022-05-12-01-07-44
              release=posthog
Annotations:  checksum/secrets.yaml: 6f81c380a81222648dd3b0b9c26b8c298d7de9c7023ddb7d5729638b16f68eba
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.192.172.153
IPs:
  IP:           10.192.172.153
Controlled By:  Job/posthog-migrate-2022-05-12-01-07-44
Init Containers:
  wait-for-service-dependencies:
    Container ID:  docker://90a494c4bffd7ba13470509b5b3a74daa1d6e4d065c581b5c616f7b7eeeb337e
    Image:         busybox:1.34
    Image ID:      docker-pullable://busybox@sha256:d2b53584f580310186df7a2055ce3ff83cc0df6caacf1e3489bff8cf5d0af5d8
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c

      until (
          wget -qO- \
              "http://$CLICKHOUSE_USER:$CLICKHOUSE_PASSWORD@clickhouse-posthog.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local:8123" \
              --post-data "SELECT count() FROM clusterAllReplicas('posthog', system, one)"
      ); do
          echo "waiting for ClickHouse cluster to become available"; sleep 1;
      done

      until (nc -vz "posthog-pgbouncer.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local" 6543); do
          echo "waiting for PgBouncer"; sleep 1;
      done

      KAFKA_BROKERS="posthog-posthog-kafka:9092"
      KAFKA_HOST=$(echo $KAFKA_BROKERS | cut -f1 -d:) KAFKA_PORT=$(echo $KAFKA_BROKERS | cut -f2 -d:)
      until (nc -vz "$KAFKA_HOST.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local" $KAFKA_PORT); do
          echo "waiting for Kafka"; sleep 1;
      done

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 12 May 2022 01:11:52 -0700
      Finished:     Thu, 12 May 2022 01:13:07 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      CLICKHOUSE_HOST:      clickhouse-posthog
      CLICKHOUSE_CLUSTER:   posthog
      CLICKHOUSE_DATABASE:  posthog
      CLICKHOUSE_USER:      admin
      CLICKHOUSE_PASSWORD:  YYYYYYYYYYYYYYYYYYYYYYYYYY
      CLICKHOUSE_SECURE:    false
      CLICKHOUSE_VERIFY:    false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gfjm (ro)
Containers:
  migrate-job:
    Container ID:  docker://2be41a6ec2018c58e42c58d9516a5a072e8b1d7643a7454d77a5e88c138fd218
    Image:         posthog/posthog:release-1.35.0
    Image ID:      docker-pullable://posthog/posthog@sha256:b0f3cfa8e259fbd98a9d219f3470888763dac157f636af7b541120adc70ab378
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      set -e
      python manage.py notify_helm_install || true
      ./bin/migrate

    State:          Running
      Started:      Thu, 12 May 2022 01:14:18 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      KAFKA_ENABLED:             true
      KAFKA_HOSTS:               posthog-posthog-kafka:9092
      KAFKA_URL:                 kafka://posthog-posthog-kafka:9092
      POSTHOG_REDIS_HOST:        XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (I am using an external Redis)
      POSTHOG_REDIS_PORT:        6379
      POSTHOG_REDIS_PASSWORD:    <set to the key 'redis-password' in secret 'posthog-posthog-redis-external'>  Optional: false
      SENTRY_DSN:
      SITE_URL:                  https://posthog.XXXXXXXXX
      DEPLOYMENT:                helm_aws_ha
      SECRET_KEY:                <set to the key 'posthog-secret' in secret 'posthog'>  Optional: false
      PRIMARY_DB:                clickhouse
      POSTHOG_POSTGRES_HOST:     XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (I am using an external Postgresql)
      POSTHOG_POSTGRES_PORT:     5432
      POSTHOG_DB_USER:           posthog_service
      POSTHOG_DB_NAME:           posthog
      POSTHOG_DB_PASSWORD:       <set to the key 'postgresql-password' in secret 'posthog-external'>  Optional: false
      USING_PGBOUNCER:           false
      CLICKHOUSE_HOST:           clickhouse-posthog
      CLICKHOUSE_CLUSTER:        posthog
      CLICKHOUSE_DATABASE:       posthog
      CLICKHOUSE_USER:           admin
      CLICKHOUSE_PASSWORD:       YYYYYYYYYYYYYYYYY
      CLICKHOUSE_SECURE:         false
      CLICKHOUSE_VERIFY:         false
      EMAIL_HOST:                XXXXXXXXXXXXX (AWS EMAIL Service)
      EMAIL_PORT:                465
      EMAIL_HOST_USER:           XXXXXXXXXXXXX
      EMAIL_HOST_PASSWORD:       <set to the key 'smtp-password' in secret 'posthog'>  Optional: false
      EMAIL_USE_TLS:             false
      EMAIL_USE_SSL:             true
      DEFAULT_FROM_EMAIL:        no-reply@visla.us
      CAPTURE_INTERNAL_METRICS:  true
      HELM_INSTALL_INFO:         {"chart_version":"18.3.1","cloud":"aws","deployment_type":"helm","hostname":"posthog.XXXXXXXXXXXXX","ingress_type":"","kube_version":"v1.22.6-eks-7d68063","operation":"install","release_name":"posthog","release_revision":1}
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gfjm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-8gfjm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

visla-xugeng commented 2 years ago

values.yaml I am using terraform (similar as helm install command) to install this chart. I tested both methods and got same result. So, I put values.yaml file here. I updated external redis and postgresql and node affinity. All other settings keep default.

fullnameOverride: ${local.posthog_full_name}

cloud: "aws"

# we will create our own ALB instead
ingress:
  enabled: false
  # for SITE_URL
  hostname: posthog.${var.dns_base_domain}

cert-manager:
  enabled: true

email:
  host: XXXXXXXXXXXXXXXX
  port: 465
  user: XXXXXXXXXXXXXXXX
  password: XXXXXXXXXXXXXXXX
  use_tls: false
  use_ssl: true
  from_email: XXXXXXXXXXXXXXXX

web:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

worker:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

plugins:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

postgresql:
  enabled: false
externalPostgresql:
  postgresqlHost: XXXXXXXXXXXXXXXX
  postgresqlPort: 5432
  postgresqlDatabase: posthog
pgbouncer:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

redis:
  enabled: false
externalRedis:
  host: XXXXXXXXXXXXXXXX
  port: 6379
  password: XXXXXXXXXXXXXXXX

kafka:
  persistence:
    size: 40Gi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

zookeeper:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

clickhouse:
  namespace: posthog
  persistence:
    size: 40Gi
  password: ${random_password.clickhouse_password.result}
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

hooks:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: XXXXXXXXXXXXXXXX/node-group
            operator: In
            values:
            - ${local.node_group_name}

visla-xugeng commented 2 years ago

@hazzadous about editting command section "set -x", I tried several times all failed. When I logged in the migrate pod, in /var/log director, I did not see any logs. I found one thing interesting about migrate file. I did not find notify_helm_install, not sure if this critical. For details, please check output below

bash-5.1$ pwd
/home/posthog/code

bash-5.1$ ps -ef|grep posthog
    1 posthog   0:00 /bin/sh -c set -e python manage.py notify_helm_install || true ./bin/migrate 
    7 posthog   0:00 python manage.py notify_helm_install
    9 posthog   0:00 sh -c clear; (bash || ash || sh)
   16 posthog   0:00 bash
   38 posthog   0:00 ps -ef
   39 posthog   0:00 grep posthog

# under /home/posthog/code
bash-5.1$ ls
babel.config.js       frontend              package.json          posthog               staticfiles           yarn.lock
bin                   gunicorn.config.py    plugin-server         requirements-dev.txt  tsconfig.json
ee                    manage.py             postcss.config.js     requirements.txt      webpack.config.js

# under /home/posthog/code/bin
bash-5.1$ ls |grep migrate
migrate
docker-migrate
migrate-check

bash-5.1$ cat migrate
#!/bin/bash
set -e

python manage.py migrate
python manage.py migrate_clickhouse
python manage.py run_async_migrations --check
python manage.py sync_replicated_schema

hazzadous commented 2 years ago

Thanks 🙏

Re editing indeed you’d need to apply this as a new manifest as command is immutable.

It’s late here in London, I’ll have a look in the morning.

One thing that is obviously interesting is that you have both Postgres enabled and externalpostres settings which isn’t the typical use and I’m not sure about the behaviour there

hazzadous commented 2 years ago

Oops no that’s pgbouncer! I’ll look in more detail in the morning!

hazzadous commented 2 years ago

One last thing that may not be relevant but I’ll mention anyway, the migrations do not run via pgbouncer iirc so you’ll need to make sure security groups are set such that Postgres is directly open to migration pods.

although it looks like you are using the same node groups?

visla-xugeng commented 2 years ago

@hazzadous Thanks a lot. I will update more details later. (Have a good night.)

hazzadous commented 2 years ago

😊

visla-xugeng commented 2 years ago

@hazzadous some updates 1: I set postgresql enabled, false, and then enabled external postgresql

postgresql:
  enabled: false

2: I double checked the security group of my external postgresql, which allowed traffic from the whole subnet of pods. So, migration pod should be able to connect to my external postgresql directly.

3: Yes, all of the posthog pods are in the same node group.

posthog-contributions-bot[bot] commented 2 years ago

This issue has 2029 words at 14 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

hazzadous commented 2 years ago

@visla-xugeng interestingly from your ps output python manage.py notify_helm_install is currently running. iirc this will be trying to phone home to app.posthog.com with self hosting version details. I'm not sure what the failure state is if it can't.

I would expect this to run very quickly. You can verify this by updating the job definition. I'm not sure if you can edit this or if you'd need to create a new one. Then remove the notify line.

If that works then we need to figure out a way to make it not hang. It could be security groups, my go to whenever something is hanging!

hazzadous commented 2 years ago

(you could also, while in the migrate pod, run the migration part manually.)

visla-xugeng commented 2 years ago

@hazzadous Thanks for your update. Does the migrate pod also need to access the Postgres DB? I login the migrate pod shell and try to use psql but command not found in this pod. If you can provide more thoughts related to the security groups, that will be very helpful. I checked the DB's sg, they allowed my whole private subnets. (All my pods and work nodes are in the private subnets.) I will further check this part.

hazzadous commented 2 years ago

@visla-xugeng yes it access PostgreSQL directly. It doesn't have psql installed. You should be able to install it, but you might need to use an image that allows you to install things on it. But you could just try running the ./bin/migrate command. If that manages to do something useful then we just need to get rid of that notify_helm_install line.

visla-xugeng commented 2 years ago

@hazzadous I found one interesting thing. Remember that I am using external redis, external postgresql when I install PostHog, then I get the issue. Today I test to skip the external redis and stay with the internal one(keep using the external posgresql), the installation go through smoothly without any problem. The migration pod run as expected. All pods are running up for now.

##ONLY enable internal redis
redis:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: xxxxxx/node-group
            operator: In
            values:
            - ${local.node_group_name}

## redis setting in the describe of migration pod:

POSTHOG_REDIS_HOST:        posthog-posthog-redis-master
POSTHOG_REDIS_PORT:        6379

##disable the internal redis and enable the external redis
redis:
  enabled: false
externalRedis:
  host: master.xxxxxxxxxx-redis-mr-cluster.xxxxxxx.cache.amazonaws.com
  password: XXXXXXX

##redis setting in the describe of migrate pod:
POSTHOG_REDIS_HOST:       master.xxxxxxxxxx-redis-mr-cluster.xxxxxxx.cache.amazonaws.com
POSTHOG_REDIS_PORT:        6379

I tested several times, once I ONLY use internal redis, the installation can be finished without any problem. If enabled the external redis and disabled the internal one, the installation will fail. Any thoughts here? Thanks,

hazzadous commented 2 years ago

@visla-xugeng ok sounds like the thing to do now is debug the connectivity between the migration pod and the external redis. Can you spin up a pod in local.node_group_name and try to connect? If not then make sure security groups are setup and that your able to route to the redis address.

visla-xugeng commented 2 years ago

@hazzadous I checked the security group, it looks good. I log in one node in the same node_group of the migrate pod, and I can access this external redis without problem.

### in the node, followed the aws instruction
sudo amazon-linux-extras install epel -y
sudo yum install gcc jemalloc-devel openssl-devel tcl tcl-devel -y
sudo wget http://download.redis.io/redis-stable.tar.gz
sudo tar xvzf redis-stable.tar.gz
cd redis-stable
sudo make BUILD_TLS=yes

### then ran command below
src/redis-cli -h master.xxxxxxxxx.cache.amazonaws.com --tls -a yyyyyyyyyyyy -p 6379

If I remove --tls, I will not be able to build this connection. Is there any place to setup the TLS for the migrate pod?

When I build a new Redis without password setting, the posthog chart can be installed without problem.
So, how does the migrate pod handle the password of the external redis (or how to handle the TLS setting)?

##my external redis config
redis:
  enabled: false
externalRedis:
  host: XXXXXXXXXXXXXXXX
  port: 6379
  password: XXXXXXXXXXXXXXXX

hazzadous commented 2 years ago

I don't think we have any setting for redis TLS afaik. That would need to be added to the chart/application although I'd need to look closer to verify that.

visla-xugeng commented 2 years ago

@hazzadous I attached a screenshot of my redis, which disabled password. Screenshot2022_05_17_083452

hazzadous commented 2 years ago

@visla-xugeng I know it's not an ideal solution but would, at least for now, using the provided in cluster Redis be an acceptable solution. There shouldn't be anything in there that requires durability so moving to ElastiCache would be relatively straight forward.

Having said that it's still very annoying that it's not working. I have this working on our cluster 🤔

visla-xugeng commented 2 years ago

@hazzadous Thanks for the quick update. I will switch to the internal redis. Hope you can figure out why the external redis did not work as expected.

voarsh2 commented 1 year ago

I get this issue when trying a new fresh install of Posthog via Helm Chart...

PostHog / charts-clickhouse