etcd discovery fails - Githubissues

pgodfrin-nov commented 1 year ago

What happened?

Please review comments made in https://github.com/etcd-io/etcd/issues/15700

etcd refuses to intialize, I don't understand why. My entire DEV system is down and unavailable because I cannot start etcd Please help

What did you expect to happen?

etcd to start

How can we reproduce it (as minimally and precisely as possible)?

see notes in https://github.com/etcd-io/etcd/issues/15700q

Anything else we need to know?

No response

Etcd version (please run commands below)

```console etcd Version: 3.5.7 Git SHA: 215b53cf3 Go Version: go1.17.13 Go OS/Arch: linux/amd64 ```

Etcd configuration (command line flags or environment variables)

postgres@devudb01:/postgres/admin/etcd=> cat devudb01-etcd.yaml 
# This is the configuration file for the etcd server.

# Human-readable name for this member.
# Member is analagous to the etcd daemon
name: 'etcd-devudb01'

# Path to the data directory.
# default: “${name}.etcd
data-dir: '/dcs/etcd/data'

# Path to the dedicated wal directory.
# Having a dedicated disk to store wal files can improve the throughput and 
# stabilize the cluster. It is highly recommended to dedicate a wal disk 
# and set --wal-dir to point to a directory on that device for a production cluster deployment.
wal-dir: '/dcs/etcd/wal'

# Number of committed transactions to trigger a snapshot to disk.
# etcd --snapshot-count configures the number of applied Raft entries 
#to hold in-memory before compaction. When --snapshot-count reaches, 
#server first persists snapshot data onto disk, and then truncates old entries. 
#When a slow follower requests logs before a compacted index, leader sends the 
#snapshot forcing the follower to overwrite its state.

# Higher --snapshot-count holds more Raft entries in memory until snapshot, 
#thus causing recurrent higher memory usage. Since leader retains latest Raft 
#entries for longer, a slow follower has more time to catch up before leader snapshot. 
#--snapshot-count is a tradeoff between higher memory usage and better availabilities of slow followers.

# Since v3.2, the default value of --snapshot-count has changed from from 10,000 to 100,000.
snapshot-count: 10000

# Time (in milliseconds) of a heartbeat interval.
# default
heartbeat-interval: 500

# Time (in milliseconds) for an election to timeout.
# default
election-timeout: 5000

# Raise alarms when backend size exceeds the given quota. 0 means use the
# default quota.
quota-backend-bytes: 0

# List of comma separated URLs to listen on for peer traffic.
listen-peer-urls: http://172.21.61.224:2380

# List of comma separated URLs to listen on for client traffic.
listen-client-urls: http://172.21.61.224:2379
#listen-client-urls: http://devewt01.novcds.io:2379

# Maximum number of snapshot files to retain (0 is unlimited).
max-snapshots: 1000

# Maximum number of wal files to retain (0 is unlimited).
max-wals: 1000

# Comma-separated white list of origins for CORS (cross-origin resource sharing).
cors:

# List of this member's peer URLs to advertise to the rest of the cluster.
# The URLs needed to be a comma-separated list.
# initial-advertise-peer-urls: http://localhost:2380
initial-advertise-peer-urls: http://172.21.61.224:2380

# List of this member's client URLs to advertise to the public.
# The URLs needed to be a comma-separated list.
# advertise-client-urls: http://localhost:2379
advertise-client-urls: http://172.21.61.224:2379
#advertise-client-urls: http://devudb01.novcds.io:2379
# advertise-client-urls: http://172.1.1.2:2379

# using DNS is complex due to SRV records and alternate config parms
# Discovery URL used to bootstrap the cluster.
discovery:

# Valid values include 'exit', 'proxy'
discovery-fallback: 'proxy'

# HTTP proxy to use for traffic to discovery service.
discovery-proxy:

# DNS domain used to bootstrap initial cluster.
discovery-srv:

# Initial cluster configuration for bootstrapping.
#initial-cluster: 'etcd-devudb01=http://172.21.61.224:2380'
#initial-cluster: 'etcd-devudb01=http://172.21.61.224:2380,etcd-devudb02=http://devudb02.novcds.io:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devudb01=http://172.21.61.224:2380'
initial-cluster: 'etcd-devudb01=http://172.21.61.224:2380,etcd-devewt02=http://172.21.69.253:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devudb01=http://172.21.61.224:2380,etcd-devudb02=http://172.21.67.12:2380,etcd-devewt02=http://172.21.69.253:2380'

# Initial cluster token for the etcd cluster during bootstrap.
initial-cluster-token: 'etcd-devudb01'

# Initial cluster state ('new' or 'existing').
#initial-cluster-state: 'new'
initial-cluster-state: 'existing'

# Reject reconfiguration requests that would cause quorum loss.
strict-reconfig-check: false

# Accept etcd V2 client requests
#enable-v2: true
# try v3...
enable-v2: false

# Enable runtime profiling data via HTTP server
enable-pprof: true

# Valid values include 'on', 'readonly', 'off'
proxy: 'off'

# Time (in milliseconds) an endpoint will be held in a failed state.
proxy-failure-wait: 5000

# Time (in milliseconds) of the endpoints refresh interval.
proxy-refresh-interval: 30000

# Time (in milliseconds) for a dial to timeout.
proxy-dial-timeout: 1000

# Time (in milliseconds) for a write to timeout.
proxy-write-timeout: 5000

# Time (in milliseconds) for a read to timeout.
proxy-read-timeout: 0

client-transport-security:
  # Path to the client server TLS cert file.
  cert-file:

  # Path to the client server TLS key file.
  key-file:

  # Enable client cert authentication.
  client-cert-auth: false

  # Path to the client server TLS trusted CA cert file.
  trusted-ca-file:

  # Client TLS using generated certificates
  auto-tls: false

peer-transport-security:
  # Path to the peer server TLS cert file.
  cert-file:

  # Path to the peer server TLS key file.
  key-file:

  # Enable peer client cert authentication.
  client-cert-auth: false

  # Path to the peer server TLS trusted CA cert file.
  trusted-ca-file:

  # Peer TLS using generated certificates.
  auto-tls: false

# Enable debug-level logging for etcd.
#debug: true
#debug: true 
log-level: error

logger: zap

# Specify 'stdout' or 'stderr' to skip journald logging even when running under systemd.
#log-outputs: [stderr]
#log-outputs: [stdout]

# Force to create a new one member cluster.
force-new-cluster: false
#force-new-cluster: true

auto-compaction-mode: periodic
auto-compaction-retention: "1"

# per https://github.com/etcd-io/etcd/issues/12093
enable-grpc-gateway: true

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

see https://github.com/etcd-io/etcd/issues/15700

Relevant log output

see https://github.com/etcd-io/etcd/issues/15700

tjungblu commented 1 year ago

Could you be so kind and include the steps you follow to get to the issue in one coherent and straightforward listing (including which node runs which command and its env)?

Are you running this locally against VMs or where are those IPs coming from? If it's VMs, can you include a vagrant setup for us to reproduce this?

pgodfrin-nov commented 1 year ago

OK Will do. I don't know how to do the vagrant thing though.

tjungblu commented 1 year ago

How do you currently run this? Just spin up some linux VM and run commands?

pgodfrin-nov commented 1 year ago

These are EC2 instances running ubuntu 20.04. These are the IPs of interest:

DEVUDB01  devudb01.novcds.io    172.21.61.224
DEVUDB02  devudb02.novcds.io    172.21.67.12

DEVEWT01 devewt01.novcds.io    172.21.58.245
DEVEWT02 devewt02.novcds.io    172.21.69.253

tjungblu commented 1 year ago

Then let's check the basics first. Are the ports open in the security groups between the instances? Ports 2379 and 2380 are most relevant here.

pgodfrin-nov commented 1 year ago

Then let's check the basics first. Are the ports open in the security groups between the instances? Ports 2379 and 2380 are most relevant here.

Yes, this entire system was fully functional before upgrading etcd to 3.5.7. It was also functioning before I sent you these initial bug reports, albeit with those intermittent errors

pgodfrin-nov commented 1 year ago

First member on devewt02:

root@ip-172-21-66-252:~# eti

DEVUDB01 ip-172-21-61-154   UniDB 01 DEV         devudb01.novcds.io    172.21.61.224
DEVUDB02 ip-172-21-66-206   UniDB 02 DEV         devudb02.novcds.io    172.21.67.12

DEVEWT01 ip-172-21-60-149   Etcd Witness 01 DEV  devewt01.novcds.io    172.21.58.245
DEVEWT02 ip-172-21-66-252   Etcd Witness 02 DEV  devewt02.novcds.io    172.21.69.253

root@ip-172-21-66-252:~# pga
root@ip-172-21-66-252:/postgres/admin# cd etcd/

root@ip-172-21-66-252:/postgres/admin/etcd# grep initial devewt02-etcd.yaml 
# initial-advertise-peer-urls: http://localhost:2380
initial-advertise-peer-urls: http://172.21.69.253:2380
# DNS domain used to bootstrap initial cluster.
initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devewt02=http://172.21.69.253:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devudb01=http://172.21.61.224:2380'
#initial-cluster: 'etcd-devewt01=http://172.21.58.245:2380,etcd-devudb01=http://172.21.61.224:2380,etcd-devewt02=http://172.21.69.253:2380'
#initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380,etcd-devudb01=http://172.21.61.224:2380'
#initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380,etcd-devudb01=http://172.21.61.224:2380,etcd-devewt01=http://172.21.58.245:2380,etcd-devudb02=http://172.21.67.12:2380'
initial-cluster-token: 'etcd-devewt02'
initial-cluster-state: 'new'
#initial-cluster-state: 'existing'

root@ip-172-21-66-252:/postgres/admin/etcd# env | grep -i endpoint
ENDPOINTS=172.21.58.245:2379,172.21.61.224:2379,172.21.67.12:2379,172.21.69.253:2379

root@ip-172-21-66-252:/postgres/admin/etcd# rm -rf /dcs/*
rm: cannot remove '/dcs/etcd': Device or resource busy

root@ip-172-21-66-252:/postgres/admin/etcd# ssc start etcd

root@ip-172-21-66-252:/postgres/admin/etcd# export ENDPOINTS=172.21.69.253:2379

root@ip-172-21-66-252:/postgres/admin/etcd# psh
postgres   64412   64411  0 13:53 pts/0    00:00:00 -bash
root@ip-172-21-66-252:/postgres/admin/etcd# ssc start etcd

root@ip-172-21-66-252:/postgres/admin/etcd# eee mem list
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
|       ID        | STATUS  |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 |      false |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
root@ip-172-21-66-252:/postgres/admin/etcd# eew mem add etcd-devudb01 --peer-urls=http://172.21.61.224:2380
Member  77437ec2a02c648 added to cluster c0c86d0a394a66f4

ETCD_NAME="etcd-devudb01"
ETCD_INITIAL_CLUSTER="etcd-devudb01=http://172.21.61.224:2380,etcd-devewt02=http://172.21.69.253:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://172.21.61.224:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

tjungblu commented 1 year ago

What is in "/dcs/"?

root@ip-172-21-66-252:/postgres/admin/etcd# rm -rf /dcs/* rm: cannot remove '/dcs/etcd': Device or resource busy

certainly doesn't look very good.

pgodfrin-nov commented 1 year ago

Next member on devudb01:

postgres@devudb01:~=> pga
postgres@devudb01:/postgres/admin=> cd etcd/

postgres@devudb01:/postgres/admin/etcd=> grep initial devudb01-etcd.yaml | grep -v ^#
initial-advertise-peer-urls: http://172.21.61.224:2380
initial-cluster: 'etcd-devudb01=http://172.21.61.224:2380,etcd-devewt02=http://172.21.69.253:2380'
initial-cluster-token: 'etcd-devudb01'
initial-cluster-state: 'existing'

postgres@devudb01:/postgres/admin/etcd=> rm -rf /dcs/*
rm: cannot remove '/dcs/etcd': Device or resource busy

postgres@devudb01:/postgres/admin/etcd=> ssc start etcd
Job for etcd.service failed because a timeout was exceeded.
See "systemctl status etcd.service" and "journalctl -xe" for details.

postgres@devudb01:/postgres/admin/etcd=> systemctl status etcd.service
● etcd.service - etcd service
     Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
     Active: failed (Result: timeout) since Wed 2023-04-12 10:26:33 CDT; 15s ago
       Docs: https://github.com/etcd-io/etcd
    Process: 1248552 ExecStart=/usr/local/bin/etcd --config-file /postgres/admin/etcd/devudb01-etcd.yaml (code=killed, signal=TERM)
   Main PID: 1248552 (code=killed, signal=TERM)
postgres@devudb01:/postgres/admin/etcd=> journalctl -xe -uetcd
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal' can see all messages.
      Pass -q to turn off this notice.

This has always worked for me in the past, even with 3.5.7. I'm reaching out to our cloud engineers to see if something changed on the backend since yesterday

pgodfrin-nov commented 1 year ago

It does not appear that anything in the backend has changed.

pgodfrin-nov commented 1 year ago

I tried starting with 3.5.2, the prior installed version - now it's hanging. Kinda tells me it's not 3.5.7 and something backend wise. Let me keep trying from a different angle. Any ideas would be appreciated...

tjungblu commented 1 year ago

rm: cannot remove '/dcs/etcd': Device or resource busy

maybe check what device/disk is mounted under that path and why it's not able to remove anything.

pgodfrin-nov commented 1 year ago

/dcs/etcd is a mount point, the the message is normal

tjungblu commented 1 year ago

a mount point to what device? is this an EBS block volume? maybe check the syslog on why it's considering itself busy.

pgodfrin-nov commented 1 year ago

ok, i got it started. I started etcd on the new member (devudb01) with just: initial-advertise-peer-urls: http://172.21.61.224:2380 The I stopped it, added the other member: initial-cluster: 'etcd-devewt02=http://172.21.69.253:2380,etcd-devudb01=http://172.21.61.224:2380' and set the state to: initial-cluster-state: 'existing' now the member list:

postgres@devudb01:/postgres/admin/etcd=> eee mem list
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
|       ID        | STATUS  |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+
| 77437ec2a02c648 | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 |      false |
| ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 |      false |
+-----------------+---------+---------------+---------------------------+---------------------------+------------+

and patroni started:

postgres 1276898       1  0 11:25 ?        00:00:02 /usr/local/bin/etcd --config-file /postgres/admin/etcd/devudb01-etcd.yaml
postgres 1277520       1  0 11:26 ?        00:00:00 /usr/bin/python3 /usr/bin/patroni /postgres/admin/patroni/devudb01-patroni.yaml
postgres 1277544       1  0 11:26 ?        00:00:00 /usr/lib/postgresql/15/bin/postgres -D /postgres/udb01/data --config-file=/postgres/udb01/data/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 --cluster_name=unidb01 --wal_level=replica --hot_standby=on --max_connections=150 --max_wal_senders=10 --max_prepared_transactions=500 --max_locks_per_transaction=64 --track_commit_timestamp=on --max_replication_slots=10 --max_worker_processes=2048 --wal_log_hints=on

pgodfrin-nov commented 1 year ago

Well - I've been able to start some of the system, dev is now available. What I did was:

delete the dcs
reconfigure the cluster to be 'new' and only with the local instance
started. etcdctl member list reports just the local member
stop etcd
reconfigure the cluster to 'existing' and added the other members
started successfully.

3 out of 4 worked this way, still trying to get the 4th one up

pgodfrin-nov commented 1 year ago

this is weird:

postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
|        ID        |  STATUS   |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
|  77437ec2a02c648 |   started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 |      false |
|  ca20b6d1b2f7a1b |   started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 |      false |
| add745bd637ed2e4 | unstarted |               | http://172.21.58.245:2380 |                           |      false |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
|        ID        |  STATUS   |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
|  77437ec2a02c648 |   started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 |      false |
|  ca20b6d1b2f7a1b |   started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 |      false |
| add745bd637ed2e4 | unstarted |               | http://172.21.58.245:2380 |                           |      false |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
|        ID        | STATUS  |     NAME      |           PEER ADDRS           |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 |      false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
|        ID        | STATUS  |     NAME      |           PEER ADDRS           |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 |      false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
|        ID        | STATUS  |     NAME      |           PEER ADDRS           |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 |      false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
|        ID        | STATUS  |     NAME      |           PEER ADDRS           |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 |      false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
|        ID        | STATUS  |     NAME      |           PEER ADDRS           |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
| 711a89204e99bb72 | started | etcd-devudb02 | http://devudb02.novcds.io:2380 | http://172.21.67.12:2379 |      false |
+------------------+---------+---------------+--------------------------------+--------------------------+------------+
postgres@devudb02:/postgres/admin/etcd=> eee mem list
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
|        ID        |  STATUS   |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+
|  77437ec2a02c648 |   started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 |      false |
|  ca20b6d1b2f7a1b |   started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 |      false |
| add745bd637ed2e4 | unstarted |               | http://172.21.58.245:2380 |                           |      false |
+------------------+-----------+---------------+---------------------------+---------------------------+------------+

Same command repeated returning different results

pgodfrin-nov commented 1 year ago

Well this has been totally bizarre. Still having issue, the dcs is simply not being built correctly evern though parts of the system are starting up. But only after the second time I issue the command.

lavacat commented 1 year ago

I've noticed that you've used export ENDPOINTS=172.21.69.253:2379. Are you using it all the time? Try passing --debug flag to etcdctl, it will print endpoint it's connecting to.

Also, I've noticed that 2380 peer port is sometimes used as client port.

pgodfrin-nov commented 1 year ago

Wow - just noticed:

root@ip-172-21-70-177:~# eee mem list
+------------------+---------+---------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
|  ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2380 |      false |   <===== ??
| 1ae253c946376707 | started | etcd-devewt01 | http://172.21.58.245:2380 | http://172.21.58.245:2379 |      false |
| 558e299499e7710b | started | etcd-devudb02 |  http://172.21.67.12:2380 |  http://172.21.67.12:2379 |      false |
| e1903acbcc924eab | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 |      false |
+------------------+---------+---------------+---------------------------+---------------------------+------------+

Belay this: "I don't understand how that happened, I will try to fix that. It's not in the yaml file like that..." It is a typo in the yaml file. I wonder if this had an impact to the 3.5.7 issues. Everything is working fine with 3.5.2 (btw, I reverted to etcd 3.5.2, still had some weirdness getting things started again, but i think that was pilot error.)

pgodfrin-nov commented 1 year ago

fixed:

root@ip-172-21-66-252:/postgres/admin/etcd# eee mem list
+------------------+---------+---------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |     NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------+---------------------------+---------------------------+------------+
|  ca20b6d1b2f7a1b | started | etcd-devewt02 | http://172.21.69.253:2380 | http://172.21.69.253:2379 |      false |
| 1ae253c946376707 | started | etcd-devewt01 | http://172.21.58.245:2380 | http://172.21.58.245:2379 |      false |
| 558e299499e7710b | started | etcd-devudb02 |  http://172.21.67.12:2380 |  http://172.21.67.12:2379 |      false |
| e1903acbcc924eab | started | etcd-devudb01 | http://172.21.61.224:2380 | http://172.21.61.224:2379 |      false |
+------------------+---------+---------------+---------------------------+---------------------------+------------+

lavacat commented 1 year ago

@pgodfrin-nov what was the fix? Port typo?

Can we close the issue?

pgodfrin-nov commented 1 year ago

I think the port typo was an inconsequential error, as the issues persisted whether or not that particular member was configured. The solution was to revert etcd to v3.5.2 . Oddly enough at v3.5.2, even with the port typo it all worked, I never noticed the typo.

I think the appropriate course of action is to review patroni 3.0.1 and it's behavior with etcd v3.5.7, even though the patroni folks will say ( https://github.com/zalando/patroni/issues/2641)

For the record, the etcd and patroni config files (yaml) had zero changes (port typo and all), so of course the gRPC gateway was on. I'm not sure what https://github.com/CyberDem0n was trying to accomplish.

I think patroni doesn't use etcd in the exact way that etcd expects, which might explain why the port typo made no difference. Nevertheless, etcd 3.5.7 was NOT initializing in an expected manner and in fact I couldn't get it run properly at all. Then perhaps there is actually an issue with 3.5.7 and NOT (just) patroni... I don't know. But I would recommend someone follow-up on that.

You may close this issue as far as I'm concerned. I will not be upgrading etcd beyond 3.5.2 for the time being. regards

lavacat commented 1 year ago

Using 3.5.2 isn't recommenced because there was data inconsistency bug

reconfigure the cluster to be 'new' and only with the local instance

are you changing --initial-cluster param?

reconfigure the cluster to 'existing' and added the other members

are you using etcdctl member add?

etcd 3.5.7 was NOT initializing in an expected manner and in fact I couldn't get it run properly at all

I've tried replicating this scenario locally and was able to start 4 node cluster with latest 3.5.8:

bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr

./bin/etcdctl member add infra2 --peer-urls=http://127.0.0.1:22380 --endpoints=http://127.0.0.1:2379
bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380' --initial-cluster-state existing --enable-pprof --logger=zap --log-outputs=stderr

./bin/etcdctl member add infra3 --peer-urls=http://127.0.0.1:32380 --endpoints=http://127.0.0.1:2379
bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state existing --enable-pprof --logger=zap --log-outputs=stderr

./bin/etcdctl member add infra4 --peer-urls=http://127.0.0.1:42380 --endpoints=http://127.0.0.1:2379
bin/etcd --name infra4 --listen-client-urls http://127.0.0.1:42379 --advertise-client-urls http://127.0.0.1:42379 --listen-peer-urls http://127.0.0.1:42380 --initial-advertise-peer-urls http://127.0.0.1:42380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380,infra4=http://127.0.0.1:42380' --initial-cluster-state existing --enable-pprof --logger=zap --log-outputs=stderr

but if you know all your cluster peer urls upfront, it's easier to start with same --initial-cluster and avoid adding each member.

I'd recommend experimenting with etcd cluster locally. Etcd is somewhat strict about peer-urls and state of the data dir on startup, so it can take couple tries to get config right.

etcd-io / etcd

etcd discovery fails #15705

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output