elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.42k stars 24.57k forks source link

Drop default port-scanning discovery behaviour #84563

Open DaveCTurner opened 2 years ago

DaveCTurner commented 2 years ago

Description

Today discovery.seed_hosts defaults to something like 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305 (depending on other config). We default to trying these nearby ports to make it easy to spin up an N-node cluster by just running Elasticsearch N times on a single machine without any further config. In fact this just doesn't work today for a couple of reasons:

  1. by default each node will consume 50% of available RAM (up to 31GiB) so you can't run two of them side-by-side.
  2. by default each node runs with security enabled so even if you do have the memory to run nodes side-by-side the nodes won't trust each other and will form independent clusters.

It's also a bit confusing in a couple of ways:

  1. if you are using a different seed hosts provider: the SettingsBasedSeedHostsProvider is always consulted so its defaults will appear in the discovery will continue using [...] from hosts providers list unless suppressed by setting discovery.seed_hosts: [].
  2. it's only a very limited scan, and only works if it finds master nodes, so even in cases where it does appear to work it won't be reliable.

In practice we believe that folks spinning up multi-node test clusters will mostly be using docker compose these days, or else doing the necessary configuration ceremonies anyway. I think we should therefore change the default to an empty list and require users to be specific about which ports correspond with which nodes.

Relates https://github.com/elastic/elasticsearch/issues/54354

Relates https://github.com/elastic/elasticsearch/issues/34809#issuecomment-1056739248

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 2 years ago

We (the @elastic/es-distributed team) discussed this today and agreed to proceed, although it's a breaking change so needs all the usual due diligence.

DaveCTurner commented 2 years ago

The specific thing I'm suggesting deprecating is as follows:

diff --git a/server/src/main/java/org/elasticsearch/transport/TcpTransport.java b/server/src/main/java/org/elasticsearch/transport/TcpTransport.java
index 884bef3a811..3aa48e71754 100644
--- a/server/src/main/java/org/elasticsearch/transport/TcpTransport.java
+++ b/server/src/main/java/org/elasticsearch/transport/TcpTransport.java
@@ -421,8 +421,17 @@ public abstract class TcpTransport extends AbstractLifecycleComponent implements
         if (NetworkUtils.SUPPORTS_V6) {
             local.add("[::1]"); // may get ports appended!
         }
+        final var defaultPortRange = defaultPortRange();
+        if (defaultPortRange.length > 1) {
+            deprecationLogger.warn(DeprecationCategory.SETTINGS, "discovery_seed_hosts_default", """
+                The default discovery behaviour is deprecated and will change in a future version. Please set [{}] to the addresses of \
+                your master-eligible nodes to prepare for the change in behaviour. If you are not using settings-based discovery then set \
+                [{}] to [], the empty list.""",
+                SettingsBasedSeedHostsProvider.DISCOVERY_SEED_HOSTS_SETTING.getKey(),
+                SettingsBasedSeedHostsProvider.DISCOVERY_SEED_HOSTS_SETTING.getKey());
+        }
         return local.stream()
-            .flatMap(address -> Arrays.stream(defaultPortRange()).limit(LIMIT_LOCAL_PORTS_COUNT).mapToObj(port -> address + ":" + port))
+            .flatMap(address -> Arrays.stream(defaultPortRange).limit(LIMIT_LOCAL_PORTS_COUNT).mapToObj(port -> address + ":" + port))
             .toList();
     }

This means that users who set discovery.seed_hosts would be unaffected, as would users who set transport.port to a specific port. This latter group is unaffected because the port scan only tries ports to which we might bind, but if there's only one port then we must have bound to it so there's no need to scan it. In particular this includes deployments on Elastic Cloud.

For the sake of completeness, this isn't quite right because we scan ports on localhost but might bind to a port with the same number on a different interface. I'm very doubtful that there are any users out there that have multiple nodes on a single host, all binding to different interfaces on the same-numbered port and relying on the localhost-binding node to do all the discovery work. We could detect this with telemetry I guess if we felt it might happen.