Closed pgodfrin-nov closed 1 year ago
Could you be so kind and include the steps you follow to get to the issue in one coherent and straightforward listing (including which node runs which command and its env)?
Are you running this locally against VMs or where are those IPs coming from? If it's VMs, can you include a vagrant setup for us to reproduce this?
OK Will do. I don't know how to do the vagrant thing though.
How do you currently run this? Just spin up some linux VM and run commands?
These are EC2 instances running ubuntu 20.04. These are the IPs of interest:
DEVUDB01 devudb01.novcds.io 172.21.61.224
DEVUDB02 devudb02.novcds.io 172.21.67.12
DEVEWT01 devewt01.novcds.io 172.21.58.245
DEVEWT02 devewt02.novcds.io 172.21.69.253
Then let's check the basics first. Are the ports open in the security groups between the instances? Ports 2379 and 2380 are most relevant here.
Then let's check the basics first. Are the ports open in the security groups between the instances? Ports 2379 and 2380 are most relevant here.
Yes, this entire system was fully functional before upgrading etcd to 3.5.7. It was also functioning before I sent you these initial bug reports, albeit with those intermittent errors
First member on devewt02:
root@ip-172-21-66-252:~# eti
DEVUDB01 ip-172-21-61-154 UniDB 01 DEV devudb01.novcds.io 172.21.61.224
DEVUDB02 ip-172-21-66-206 UniDB 02 DEV devudb02.novcds.io 172.21.67.12
DEVEWT01 ip-172-21-60-149 Etcd Witness 01 DEV devewt01.novcds.io 172.21.58.245
DEVEWT02 ip-172-21-66-252 Etcd Witness 02 DEV devewt02.novcds.io 172.21.69.253
root@ip-172-21-66-252:~# pga
root@ip-172-21-66-252:/postgres/admin# cd etcd/
root@ip-172-21-66-252:/postgres/admin/etcd# grep initial devewt02-etcd.yaml
# initial-advertise-peer-urls: http://localhost:2380
initial-advertise-peer-urls: http://172.21.69.253:2380
# DNS domain used to bootstrap initial cluster.
initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devewt02=http://172.21.69.253:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devudb01=http://172.21.61.224:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devudb01=http://172.21.61.224:2380,etcd-devewt02=http://172.21.69.253:2380'
#initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380,etcd-devudb01=http://172.21.61.224:2380'
#initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380,etcd-devudb01=http://172.21.61.224:2380,etcd-devewt01=http://172.21.58.245:2380,etcd-devudb02=http://172.21.67.12:2380'
initial-cluster-token: 'etcd-devewt02'
initial-cluster-state: 'new'
#initial-cluster-state: 'existing'
root@ip-172-21-66-252:/postgres/admin/etcd# env | grep -i endpoint
ENDPOINTS=172.21.58.245:2379,172.21.61.224:2379,172.21.67.12:2379,172.21.69.253:2379
root@ip-172-21-66-252:/postgres/admin/etcd# rm -rf /dcs/*
rm: cannot remove '/dcs/etcd': Device or resource busy
root@ip-172-21-66-252:/postgres/admin/etcd# ssc start etcd
root@ip-172-21-66-252:/postgres/admin/etcd# export ENDPOINTS=172.21.69.253:2379
root@ip-172-21-66-252:/postgres/admin/etcd# psh
postgres 64412 64411 0 13:53 pts/0 00:00:00 -bash
root@ip-172-21-66-252:/postgres/admin/etcd# ssc start etcd
root@ip-172-21-66-252:/postgres/admin/etcd# eee mem list
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 | false |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
root@ip-172-21-66-252:/postgres/admin/etcd# eew mem add etcd-devudb01 --peer-urls=http://172.21.61.224:2380
Member 77437ec2a02c648 added to cluster c0c86d0a394a66f4
ETCD_NAME="etcd-devudb01"
ETCD_INITIAL_CLUSTER="etcd-devudb01=http://172.21.61.224:2380,etcd-devewt02=http://172.21.69.253:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://172.21.61.224:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
What is in "/dcs/"?
root@ip-172-21-66-252:/postgres/admin/etcd# rm -rf /dcs/* rm: cannot remove '/dcs/etcd': Device or resource busy
certainly doesn't look very good.
Next member on devudb01:
postgres@devudb01:~=> pga
postgres@devudb01:/postgres/admin=> cd etcd/
postgres@devudb01:/postgres/admin/etcd=> grep initial devudb01-etcd.yaml | grep -v ^#
initial-advertise-peer-urls: http://172.21.61.224:2380
initial-cluster: 'etcd-devudb01=http://172.21.61.224:2380,etcd-devewt02=http://172.21.69.253:2380'
initial-cluster-token: 'etcd-devudb01'
initial-cluster-state: 'existing'
postgres@devudb01:/postgres/admin/etcd=> rm -rf /dcs/*
rm: cannot remove '/dcs/etcd': Device or resource busy
postgres@devudb01:/postgres/admin/etcd=> ssc start etcd
Job for etcd.service failed because a timeout was exceeded.
See "systemctl status etcd.service" and "journalctl -xe" for details.
postgres@devudb01:/postgres/admin/etcd=> systemctl status etcd.service
● etcd.service - etcd service
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Wed 2023-04-12 10:26:33 CDT; 15s ago
Docs: https://github.com/etcd-io/etcd
Process: 1248552 ExecStart=/usr/local/bin/etcd --config-file /postgres/admin/etcd/devudb01-etcd.yaml (code=killed, signal=TERM)
Main PID: 1248552 (code=killed, signal=TERM)
postgres@devudb01:/postgres/admin/etcd=> journalctl -xe -uetcd
Hint: You are currently not seeing messages from other users and the system.
Users in groups 'adm', 'systemd-journal' can see all messages.
Pass -q to turn off this notice.
This has always worked for me in the past, even with 3.5.7. I'm reaching out to our cloud engineers to see if something changed on the backend since yesterday
It does not appear that anything in the backend has changed.
I tried starting with 3.5.2, the prior installed version - now it's hanging. Kinda tells me it's not 3.5.7 and something backend wise. Let me keep trying from a different angle. Any ideas would be appreciated...
rm: cannot remove '/dcs/etcd': Device or resource busy
maybe check what device/disk is mounted under that path and why it's not able to remove anything.
/dcs/etcd is a mount point, the the message is normal
a mount point to what device? is this an EBS block volume? maybe check the syslog on why it's considering itself busy.
ok, i got it started. I started etcd on the new member (devudb01) with just: initial-advertise-peer-urls: http://172.21.61.224:2380
The I stopped it, added the other member:
initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380,etcd-devudb01=http://172.21.61.224:2380'
and set the state to: initial-cluster-state: 'existing'
now the member list:
postgres@devudb01:/postgres/admin/etcd=> eee mem list
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
| 77437ec2a02c648 | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 | false |
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 | false |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
and patroni started:
postgres 1276898 1 0 11:25 ? 00:00:02 /usr/local/bin/etcd --config-file /postgres/admin/etcd/devudb01-etcd.yaml
postgres 1277520 1 0 11:26 ? 00:00:00 /usr/bin/python3 /usr/bin/patroni /postgres/admin/patroni/devudb01-patroni.yaml
postgres 1277544 1 0 11:26 ? 00:00:00 /usr/lib/postgresql/15/bin/postgres -D /postgres/udb01/data --config-file=/postgres/udb01/data/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 --cluster_name=unidb01 --wal_level=replica --hot_standby=on --max_connections=150 --max_wal_senders=10 --max_prepared_transactions=500 --max_locks_per_transaction=64 --track_commit_timestamp=on --max_replication_slots=10 --max_worker_processes=2048 --wal_log_hints=on
Well - I've been able to start some of the system, dev is now available. What I did was:
3 out of 4 worked this way, still trying to get the 4th one up
this is weird:
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
| 77437ec2a02c648 | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 | false |
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 | false |
| add745bd637ed2e4 | unstarted | | http://172.21.58.245:2380 | | false |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
| 77437ec2a02c648 | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 | false |
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 | false |
| add745bd637ed2e4 | unstarted | | http://172.21.58.245:2380 | | false |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 | false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 | false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 | false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 | false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 | false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
| 77437ec2a02c648 | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 | false |
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 | false |
| add745bd637ed2e4 | unstarted | | http://172.21.58.245:2380 | | false |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
Same command repeated returning different results
Well this has been totally bizarre. Still having issue, the dcs is simply not being built correctly evern though parts of the system are starting up. But only after the second time I issue the command.
I've noticed that you've used export ENDPOINTS=172.21.69.253:2379
. Are you using it all the time?
Try passing --debug
flag to etcdctl
, it will print endpoint it's connecting to.
Also, I've noticed that 2380
peer port is sometimes used as client port.
Wow - just noticed:
root@ip-172-21-70-177:~# eee mem list
+------------------+---------+---------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 | false | <===== ??
| 1ae253c946376707 | started | etcd-devewt01 | http://172.21.58.245:2380 | http://172.21.58.245:2379 | false |
| 558e299499e7710b | started | etcd-devudb02 | http://172.21.67.12:2380 | http://172.21.67.12:2379 | false |
| e1903acbcc924eab | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 | false |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
Belay this: "I don't understand how that happened, I will try to fix that. It's not in the yaml file like that..." It is a typo in the yaml file. I wonder if this had an impact to the 3.5.7 issues. Everything is working fine with 3.5.2 (btw, I reverted to etcd 3.5.2, still had some weirdness getting things started again, but i think that was pilot error.)
fixed:
root@ip-172-21-66-252:/postgres/admin/etcd# eee mem list
+------------------+---------+---------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2379 | false |
| 1ae253c946376707 | started | etcd-devewt01 | http://172.21.58.245:2380 | http://172.21.58.245:2379 | false |
| 558e299499e7710b | started | etcd-devudb02 | http://172.21.67.12:2380 | http://172.21.67.12:2379 | false |
| e1903acbcc924eab | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 | false |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
@pgodfrin-nov what was the fix? Port typo?
Can we close the issue?
I think the port typo was an inconsequential error, as the issues persisted whether or not that particular member was configured. The solution was to revert etcd to v3.5.2 . Oddly enough at v3.5.2, even with the port typo it all worked, I never noticed the typo.
I think the appropriate course of action is to review patroni 3.0.1 and it's behavior with etcd v3.5.7, even though the patroni folks will say ( https://github.com/zalando/patroni/issues/2641)
For the record, the etcd and patroni config files (yaml) had zero changes (port typo and all), so of course the gRPC gateway was on. I'm not sure what https://github.com/CyberDem0n was trying to accomplish.
I think patroni doesn't use etcd in the exact way that etcd expects, which might explain why the port typo made no difference. Nevertheless, etcd 3.5.7 was NOT initializing in an expected manner and in fact I couldn't get it run properly at all. Then perhaps there is actually an issue with 3.5.7 and NOT (just) patroni... I don't know. But I would recommend someone follow-up on that.
You may close this issue as far as I'm concerned. I will not be upgrading etcd beyond 3.5.2 for the time being. regards
Using 3.5.2 isn't recommenced because there was data inconsistency bug
reconfigure the cluster to be 'new' and only with the local instance
are you changing --initial-cluster
param?
reconfigure the cluster to 'existing' and added the other members
are you using etcdctl member add
?
etcd 3.5.7 was NOT initializing in an expected manner and in fact I couldn't get it run properly at all
I've tried replicating this scenario locally and was able to start 4 node cluster with latest 3.5.8:
bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr
./bin/etcdctl member add infra2 --peer-urls=http://127.0.0.1:22380 --endpoints=http://127.0.0.1:2379
bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing --enable-pprof --logger=zap --log-outputs=stderr
./bin/etcdctl member add infra3 --peer-urls=http://127.0.0.1:32380 --endpoints=http://127.0.0.1:2379
bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state existing --enable-pprof --logger=zap --log-outputs=stderr
./bin/etcdctl member add infra4 --peer-urls=http://127.0.0.1:42380 --endpoints=http://127.0.0.1:2379
bin/etcd --name infra4 --listen-client-urls http://127.0.0.1:42379 --advertise-client-urls http://127.0.0.1:42379 --listen-peer-urls http://127.0.0.1:42380 --initial-advertise-peer-urls http://127.0.0.1:42380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380,infra4=http://127.0.0.1:42380' --initial-cluster-state existing --enable-pprof --logger=zap --log-outputs=stderr
but if you know all your cluster peer urls upfront, it's easier to start with same --initial-cluster
and avoid adding each member.
I'd recommend experimenting with etcd cluster locally. Etcd is somewhat strict about peer-urls and state of the data dir on startup, so it can take couple tries to get config right.
What happened?
Please review comments made in https://github.com/etcd-io/etcd/issues/15700
etcd refuses to intialize, I don't understand why. My entire DEV system is down and unavailable because I cannot start etcd Please help
What did you expect to happen?
etcd to start
How can we reproduce it (as minimally and precisely as possible)?
see notes in https://github.com/etcd-io/etcd/issues/15700q
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output