Open visla-xugeng opened 2 years ago
@guidoiaquinti Hi Guido, After further review and test on my side, I think the issue is more related to clickhouse. (Not related to external redis or external postgresql) The migrate pod cannot connect to clickhouse cluster. As a result, the container, wait-for-service-dependencies, failed to complete.
Do you have any idea here? Thanks
👋 Hi @visla-xugeng! I'm sorry to hear about this issue. Can you please check both CH pods to see if there's anything out of the ordinary in the logs?
We've been tracking a possible regression in the upstream Altinity/clickhouse-operator
where a healthy CH pod doesn't get marked as healthy by the operator pod.
It happens when the clickhouse.altinity.com/ready=yes
label doesn't get created and attached by the operator pod to the CH pod. This leads to the CH service to don't have any pod associated -> PostHog being unhappy.
Can you please verify the above? The current workaround is to manually kill the operator pod.
@guidoiaquinti Thanks for your response. I checked the logs, did not see what you mentioned. You can see that my chi-posthog-posthog-0-0-0 has clickhouse.altinity.com/ready=yes attached.
kubectl -n posthog describe pod chi-posthog-posthog-0-0-0
Name: chi-posthog-posthog-0-0-0
Namespace: posthog
Priority: 0
Node: ip-10-192-83-121.cn-northwest-1.compute.internal/10.192.83.121
Start Time: Tue, 10 May 2022 10:43:56 -0700
Labels: app.kubernetes.io/managed-by=Helm
clickhouse.altinity.com/app=chop
clickhouse.altinity.com/chi=posthog
clickhouse.altinity.com/cluster=posthog
clickhouse.altinity.com/namespace=posthog
clickhouse.altinity.com/ready=yes
clickhouse.altinity.com/replica=0
clickhouse.altinity.com/settings-version=208b41823c98541c9c4f31abd14664612d884c1a
clickhouse.altinity.com/shard=0
clickhouse.altinity.com/zookeeper-version=006911d48c8f8eb94431829be3125b34ad63f661
controller-revision-hash=chi-posthog-posthog-0-0-78bdf996fb
statefulset.kubernetes.io/pod-name=chi-posthog-posthog-0-0-0
From the log of chi-posthog-posthog-0-0-0, I see one error about zookeeper. However, from the zookeeper pod, I did not see any errors.
2022.05.10 16:26:32.352281 [ 7 ] {} <Information> Application: Ready for connections.
2022.05.10 16:26:35.367684 [ 45 ] {} <Error> virtual bool DB::DDLWorker::initializeMainThread(): Code: 999, e.displayText() = Coordination::Exception: All connection tries failed while connecting to ZooKeeper. nodes: 172.20.202.144:2181
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Timeout: connect timed out: 172.20.202.144:2181 (version 21.6.5.37 (official build)), 172.20.202.144:2181
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Timeout: connect timed out: 172.20.202.144:2181 (version 21.6.5.37 (official build)), 172.20.202.144:2181
Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = Timeout: connect timed out: 172.20.202.144:2181 (version 21.6.5.37 (official build)), 172.20.202.144:2181
(Connection loss), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8b6cbba in /usr/bin/clickhouse
1. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 0x107c1635 in /usr/bin/clickhouse
2. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error) @ 0x107c18c2 in /usr/bin/clickhouse
3. Coordination::ZooKeeper::connect(std::__1::vector<Coordination::ZooKeeper::Node, std::__1::allocator<Coordination::ZooKeeper::Node> > const&, Poco::Timespan) @ 0x1080132b in /usr/bin/clickhouse
4. Coordination::ZooKeeper::ZooKeeper(std::__1::vector<Coordination::ZooKeeper::Node, std::__1::allocator<Coordination::ZooKeeper::Node> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Poco::Timespan, Poco::Timespan, Poco::Timespan) @ 0x107ffb7b in /usr/bin/clickhouse
5. zkutil::ZooKeeper::init(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x107c403e in /usr/bin/clickhouse
6. zkutil::ZooKeeper::ZooKeeper(Poco::Util::AbstractConfiguration const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x107c5db6 in /usr/bin/clickhouse
7. void std::__1::allocator_traits<std::__1::allocator<zkutil::ZooKeeper> >::__construct<zkutil::ZooKeeper, Poco::Util::AbstractConfiguration const&, char const (&) [10]>(std::__1::integral_constant<bool, true>, std::__1::allocator<zkutil::ZooKeeper>&, zkutil::ZooKeeper*, Poco::Util::AbstractConfiguration const&, char const (&) [10]) @ 0xf54ac37 in /usr/bin/clickhouse
8. DB::Context::getZooKeeper() const @ 0xf5267a5 in /usr/bin/clickhouse
9. DB::DDLWorker::getAndSetZooKeeper() @ 0xf56b3d6 in /usr/bin/clickhouse
10. DB::DDLWorker::initializeMainThread() @ 0xf580155 in /usr/bin/clickhouse
11. DB::DDLWorker::runMainThread() @ 0xf569091 in /usr/bin/clickhouse
12. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 0xf581131 in /usr/bin/clickhouse
13. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x8bacedf in /usr/bin/clickhouse
14. ? @ 0x8bb0403 in /usr/bin/clickhouse
15. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
16. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
Log from zookeeper. (Assume the zookeeper is running up and ready)
zookeeper 16:26:24.97
zookeeper 16:26:24.97 Welcome to the Bitnami zookeeper container
zookeeper 16:26:24.97 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-zookeeper
zookeeper 16:26:24.97 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-zookeeper/issues
zookeeper 16:26:24.97
zookeeper 16:26:24.98 INFO ==> ** Starting ZooKeeper setup **
zookeeper 16:26:25.00 WARN ==> You have set the environment variable ALLOW_ANONYMOUS_LOGIN=yes. For safety reasons, do not use this flag in a production environment.
zookeeper 16:26:25.04 INFO ==> Initializing ZooKeeper...
zookeeper 16:26:25.04 INFO ==> No injected configuration file found, creating default config files...
zookeeper 16:26:25.10 INFO ==> No additional servers were specified. ZooKeeper will run in standalone mode...
zookeeper 16:26:25.11 INFO ==> Deploying ZooKeeper from scratch...
zookeeper 16:26:25.12 INFO ==> ** ZooKeeper setup finished! **
zookeeper 16:26:25.14 INFO ==> ** Starting ZooKeeper **
/opt/bitnami/java/bin/java
ZooKeeper JMX enabled by default
Using config: /opt/bitnami/zookeeper/bin/../conf/zoo.cfg
Update: The error log of chi-posthog-posthog-0-0-0 mentioned above was gone after I reboot the pod. No more connection timeout issue in this pod.
Troubleshooting came back to the original point. Let me re-organize my findings here. 1: Pgbouncer still has lots of error log. Based on this https://github.com/pgbouncer/pgbouncer/issues/323. Looks like this error message does not matter, but not sure. Several pods, Events, Plugins, Web and Worker, still use Pgbouncer as proxy to connect external PostgreSQL. please help verify if this error message has any impact here.
2022-05-10 21:41:04.348 UTC [1] LOG C-0x7f23577a6510: (nodb)/(nouser)@10.192.158.50:36584 closing because: client unexpected eof (age=0s)
2022-05-10 21:41:14.346 UTC [1] LOG C-0x7f23577a6510: (nodb)/(nouser)@10.192.158.50:36738 closing because: client unexpected eof (age=0s)
2: The job, posthog-migrate-2022-05-10-09-24-27--1-zdlwx, still in running status not complete. It has 2 containers inside. One is wait-for-service-dependencies which is terminated and completed. I think this is good. The second container is migrate-job, which is still running but no log shown up. I think something got stuck, but could not figure it out.
kubectl -n posthog logs posthog-migrate-2022-05-10-09-24-27--1-zdlwx migrate-job -f
###nothing is here
kubectl -n posthog logs posthog-migrate-2022-05-10-09-24-27--1-zdlwx wait-for-service-dependencies -f
1
posthog-pgbouncer.posthog.svc.cluster.local (172.20.168.143:6543) open
posthog-posthog-kafka.posthog.svc.cluster.local (172.20.126.179:9092) open
3: Other pods, Event, plugins, web, worker, are all in "Init:1/2" status. All these pods have 3 containers inside. Only one is completed and terminated, which is wait-for-service-dependencies. Other two, wait-for-migrations (in running status) and posthog-events (in waiting status). I think this is related to the pod, posthog-migrate-2022-05-10-09-24-27--1-zdlwx. Since this pod could not finish it's job, so all other pods are waiting for the update. As a result, all of these pods cannot be up and running.
In general, based on the analysis above, I think the issue is on the pod, posthog-migrate-2022-05-10-09-24-27--1-zdlwx. But I cannot dig deeper. Do you have any idea?
kubectl describe
of the migration pod? Can you provide the full yaml manifest? You can try kubectl edit
ing the containers command
definition, add set -x
to get the lines its running printed to output. Then we can see where it is getting stuck.describe of migrate pod
kubectl -n posthog describe pod posthog-migrate-2022-05-12-01-07-44--1-6qjr7
Name: posthog-migrate-2022-05-12-01-07-44--1-6qjr7
Namespace: posthog
Priority: 0
Node: ip-10-192-189-39.xxxxxxxxxxxxxxxxxxxxxxx/10.192.189.39
Start Time: Thu, 12 May 2022 01:11:46 -0700
Labels: app=posthog
controller-uid=5687728c-d1b7-4ebe-b1c6-2d9c4ab39e4c
job-name=posthog-migrate-2022-05-12-01-07-44
release=posthog
Annotations: checksum/secrets.yaml: 6f81c380a81222648dd3b0b9c26b8c298d7de9c7023ddb7d5729638b16f68eba
kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.192.172.153
IPs:
IP: 10.192.172.153
Controlled By: Job/posthog-migrate-2022-05-12-01-07-44
Init Containers:
wait-for-service-dependencies:
Container ID: docker://90a494c4bffd7ba13470509b5b3a74daa1d6e4d065c581b5c616f7b7eeeb337e
Image: busybox:1.34
Image ID: docker-pullable://busybox@sha256:d2b53584f580310186df7a2055ce3ff83cc0df6caacf1e3489bff8cf5d0af5d8
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
until (
wget -qO- \
"http://$CLICKHOUSE_USER:$CLICKHOUSE_PASSWORD@clickhouse-posthog.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local:8123" \
--post-data "SELECT count() FROM clusterAllReplicas('posthog', system, one)"
); do
echo "waiting for ClickHouse cluster to become available"; sleep 1;
done
until (nc -vz "posthog-pgbouncer.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local" 6543); do
echo "waiting for PgBouncer"; sleep 1;
done
KAFKA_BROKERS="posthog-posthog-kafka:9092"
KAFKA_HOST=$(echo $KAFKA_BROKERS | cut -f1 -d:) KAFKA_PORT=$(echo $KAFKA_BROKERS | cut -f2 -d:)
until (nc -vz "$KAFKA_HOST.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local" $KAFKA_PORT); do
echo "waiting for Kafka"; sleep 1;
done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 12 May 2022 01:11:52 -0700
Finished: Thu, 12 May 2022 01:13:07 -0700
Ready: True
Restart Count: 0
Environment:
CLICKHOUSE_HOST: clickhouse-posthog
CLICKHOUSE_CLUSTER: posthog
CLICKHOUSE_DATABASE: posthog
CLICKHOUSE_USER: admin
CLICKHOUSE_PASSWORD: YYYYYYYYYYYYYYYYYYYYYYYYYY
CLICKHOUSE_SECURE: false
CLICKHOUSE_VERIFY: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gfjm (ro)
Containers:
migrate-job:
Container ID: docker://2be41a6ec2018c58e42c58d9516a5a072e8b1d7643a7454d77a5e88c138fd218
Image: posthog/posthog:release-1.35.0
Image ID: docker-pullable://posthog/posthog@sha256:b0f3cfa8e259fbd98a9d219f3470888763dac157f636af7b541120adc70ab378
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
set -e
python manage.py notify_helm_install || true
./bin/migrate
State: Running
Started: Thu, 12 May 2022 01:14:18 -0700
Ready: True
Restart Count: 0
Environment:
KAFKA_ENABLED: true
KAFKA_HOSTS: posthog-posthog-kafka:9092
KAFKA_URL: kafka://posthog-posthog-kafka:9092
POSTHOG_REDIS_HOST: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (I am using an external Redis)
POSTHOG_REDIS_PORT: 6379
POSTHOG_REDIS_PASSWORD: <set to the key 'redis-password' in secret 'posthog-posthog-redis-external'> Optional: false
SENTRY_DSN:
SITE_URL: https://posthog.XXXXXXXXX
DEPLOYMENT: helm_aws_ha
SECRET_KEY: <set to the key 'posthog-secret' in secret 'posthog'> Optional: false
PRIMARY_DB: clickhouse
POSTHOG_POSTGRES_HOST: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (I am using an external Postgresql)
POSTHOG_POSTGRES_PORT: 5432
POSTHOG_DB_USER: posthog_service
POSTHOG_DB_NAME: posthog
POSTHOG_DB_PASSWORD: <set to the key 'postgresql-password' in secret 'posthog-external'> Optional: false
USING_PGBOUNCER: false
CLICKHOUSE_HOST: clickhouse-posthog
CLICKHOUSE_CLUSTER: posthog
CLICKHOUSE_DATABASE: posthog
CLICKHOUSE_USER: admin
CLICKHOUSE_PASSWORD: YYYYYYYYYYYYYYYYY
CLICKHOUSE_SECURE: false
CLICKHOUSE_VERIFY: false
EMAIL_HOST: XXXXXXXXXXXXX (AWS EMAIL Service)
EMAIL_PORT: 465
EMAIL_HOST_USER: XXXXXXXXXXXXX
EMAIL_HOST_PASSWORD: <set to the key 'smtp-password' in secret 'posthog'> Optional: false
EMAIL_USE_TLS: false
EMAIL_USE_SSL: true
DEFAULT_FROM_EMAIL: no-reply@visla.us
CAPTURE_INTERNAL_METRICS: true
HELM_INSTALL_INFO: {"chart_version":"18.3.1","cloud":"aws","deployment_type":"helm","hostname":"posthog.XXXXXXXXXXXXX","ingress_type":"","kube_version":"v1.22.6-eks-7d68063","operation":"install","release_name":"posthog","release_revision":1}
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gfjm (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-8gfjm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
values.yaml I am using terraform (similar as helm install command) to install this chart. I tested both methods and got same result. So, I put values.yaml file here. I updated external redis and postgresql and node affinity. All other settings keep default.
fullnameOverride: ${local.posthog_full_name}
cloud: "aws"
# we will create our own ALB instead
ingress:
enabled: false
# for SITE_URL
hostname: posthog.${var.dns_base_domain}
cert-manager:
enabled: true
email:
host: XXXXXXXXXXXXXXXX
port: 465
user: XXXXXXXXXXXXXXXX
password: XXXXXXXXXXXXXXXX
use_tls: false
use_ssl: true
from_email: XXXXXXXXXXXXXXXX
web:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
worker:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
plugins:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
postgresql:
enabled: false
externalPostgresql:
postgresqlHost: XXXXXXXXXXXXXXXX
postgresqlPort: 5432
postgresqlDatabase: posthog
pgbouncer:
enabled: true
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
redis:
enabled: false
externalRedis:
host: XXXXXXXXXXXXXXXX
port: 6379
password: XXXXXXXXXXXXXXXX
kafka:
persistence:
size: 40Gi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
zookeeper:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
clickhouse:
namespace: posthog
persistence:
size: 40Gi
password: ${random_password.clickhouse_password.result}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
hooks:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: XXXXXXXXXXXXXXXX/node-group
operator: In
values:
- ${local.node_group_name}
@hazzadous about editting command section "set -x", I tried several times all failed. When I logged in the migrate pod, in /var/log director, I did not see any logs. I found one thing interesting about migrate file. I did not find notify_helm_install, not sure if this critical. For details, please check output below
bash-5.1$ pwd
/home/posthog/code
bash-5.1$ ps -ef|grep posthog
1 posthog 0:00 /bin/sh -c set -e python manage.py notify_helm_install || true ./bin/migrate
7 posthog 0:00 python manage.py notify_helm_install
9 posthog 0:00 sh -c clear; (bash || ash || sh)
16 posthog 0:00 bash
38 posthog 0:00 ps -ef
39 posthog 0:00 grep posthog
# under /home/posthog/code
bash-5.1$ ls
babel.config.js frontend package.json posthog staticfiles yarn.lock
bin gunicorn.config.py plugin-server requirements-dev.txt tsconfig.json
ee manage.py postcss.config.js requirements.txt webpack.config.js
# under /home/posthog/code/bin
bash-5.1$ ls |grep migrate
migrate
docker-migrate
migrate-check
bash-5.1$ cat migrate
#!/bin/bash
set -e
python manage.py migrate
python manage.py migrate_clickhouse
python manage.py run_async_migrations --check
python manage.py sync_replicated_schema
Thanks 🙏
Re editing indeed you’d need to apply this as a new manifest as command is immutable.
It’s late here in London, I’ll have a look in the morning.
One thing that is obviously interesting is that you have both Postgres enabled and externalpostres settings which isn’t the typical use and I’m not sure about the behaviour there
Oops no that’s pgbouncer! I’ll look in more detail in the morning!
One last thing that may not be relevant but I’ll mention anyway, the migrations do not run via pgbouncer iirc so you’ll need to make sure security groups are set such that Postgres is directly open to migration pods.
although it looks like you are using the same node groups?
@hazzadous Thanks a lot. I will update more details later. (Have a good night.)
😊
@hazzadous some updates 1: I set postgresql enabled, false, and then enabled external postgresql
postgresql:
enabled: false
2: I double checked the security group of my external postgresql, which allowed traffic from the whole subnet of pods. So, migration pod should be able to connect to my external postgresql directly.
3: Yes, all of the posthog pods are in the same node group.
This issue has 2029 words at 14 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
Is this issue intended to be sprawling? Consider adding label epic
or sprint
to indicate this.
@visla-xugeng interestingly from your ps
output python manage.py notify_helm_install
is currently running. iirc this will be trying to phone home to app.posthog.com with self hosting version details. I'm not sure what the failure state is if it can't.
I would expect this to run very quickly. You can verify this by updating the job definition. I'm not sure if you can edit this or if you'd need to create a new one. Then remove the notify line.
If that works then we need to figure out a way to make it not hang. It could be security groups, my go to whenever something is hanging!
(you could also, while in the migrate pod, run the migration part manually.)
@hazzadous Thanks for your update. Does the migrate pod also need to access the Postgres DB? I login the migrate pod shell and try to use psql but command not found in this pod. If you can provide more thoughts related to the security groups, that will be very helpful. I checked the DB's sg, they allowed my whole private subnets. (All my pods and work nodes are in the private subnets.) I will further check this part.
@visla-xugeng yes it access PostgreSQL directly. It doesn't have psql installed. You should be able to install it, but you might need to use an image that allows you to install things on it. But you could just try running the ./bin/migrate
command. If that manages to do something useful then we just need to get rid of that notify_helm_install line.
@hazzadous I found one interesting thing. Remember that I am using external redis, external postgresql when I install PostHog, then I get the issue. Today I test to skip the external redis and stay with the internal one(keep using the external posgresql), the installation go through smoothly without any problem. The migration pod run as expected. All pods are running up for now.
##ONLY enable internal redis
redis:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: xxxxxx/node-group
operator: In
values:
- ${local.node_group_name}
## redis setting in the describe of migration pod:
POSTHOG_REDIS_HOST: posthog-posthog-redis-master
POSTHOG_REDIS_PORT: 6379
##disable the internal redis and enable the external redis
redis:
enabled: false
externalRedis:
host: master.xxxxxxxxxx-redis-mr-cluster.xxxxxxx.cache.amazonaws.com
password: XXXXXXX
##redis setting in the describe of migrate pod:
POSTHOG_REDIS_HOST: master.xxxxxxxxxx-redis-mr-cluster.xxxxxxx.cache.amazonaws.com
POSTHOG_REDIS_PORT: 6379
I tested several times, once I ONLY use internal redis, the installation can be finished without any problem. If enabled the external redis and disabled the internal one, the installation will fail. Any thoughts here? Thanks,
@visla-xugeng ok sounds like the thing to do now is debug the connectivity between the migration pod and the external redis. Can you spin up a pod in local.node_group_name and try to connect? If not then make sure security groups are setup and that your able to route to the redis address.
@hazzadous I checked the security group, it looks good. I log in one node in the same node_group of the migrate pod, and I can access this external redis without problem.
### in the node, followed the aws instruction
sudo amazon-linux-extras install epel -y
sudo yum install gcc jemalloc-devel openssl-devel tcl tcl-devel -y
sudo wget http://download.redis.io/redis-stable.tar.gz
sudo tar xvzf redis-stable.tar.gz
cd redis-stable
sudo make BUILD_TLS=yes
### then ran command below
src/redis-cli -h master.xxxxxxxxx.cache.amazonaws.com --tls -a yyyyyyyyyyyy -p 6379
When I build a new Redis without password setting, the posthog chart can be installed without problem.
So, how does the migrate pod handle the password of the external redis (or how to handle the TLS setting)?
##my external redis config
redis:
enabled: false
externalRedis:
host: XXXXXXXXXXXXXXXX
port: 6379
password: XXXXXXXXXXXXXXXX
I don't think we have any setting for redis TLS afaik. That would need to be added to the chart/application although I'd need to look closer to verify that.
@hazzadous I attached a screenshot of my redis, which disabled password.
@visla-xugeng I know it's not an ideal solution but would, at least for now, using the provided in cluster Redis be an acceptable solution. There shouldn't be anything in there that requires durability so moving to ElastiCache would be relatively straight forward.
Having said that it's still very annoying that it's not working. I have this working on our cluster 🤔
@hazzadous Thanks for the quick update. I will switch to the internal redis. Hope you can figure out why the external redis did not work as expected.
I get this issue when trying a new fresh install of Posthog via Helm Chart...
Bug description
I tried to install the PostHog in a brand new EKS in AWS by using helm commands. But still get several pods in Init:1/2 status.
Expected behavior
All pods should be up running
Actual behavior
Several pods in
Init:1/2
status and get some errors from podsHow to reproduce
Follow the instruction, you can get these errors
Environment
I deployed the chart in EKS on AWS
Additional context
logs from pod, posthog-migrate , container, wait-for-service-dependencies
logs from pod, posthog-events, container, wait-for-service-dependencies