elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.9k stars 24.73k forks source link

Numerous openjdk22 failures around thirdPartyAudit or connectivity or elasticsearch-keystore issues #103963

Closed kingherc closed 7 months ago

kingherc commented 9 months ago

CI Link

https://gradle-enterprise.elastic.co/s/iamuhpqjsu2mc

Repro line

N/A

Does it reproduce?

Didn't try

Applicable branches

main, 8.12 etc.

Failure history

No response

Failure excerpt

Since yesterday, there seem to be a lot of build scan failures around openjdk22. I am opening a single ticket in the hope there's some single cause for all of them.

Here are some examples of the broad types I've observed.

elasticsearch / periodic / main / openjdk22 / checkPart1 / java-matrix

WARNING: While scanning classes to check, the following referenced classes were not found on classpath (this may miss some violations):
WARNING:   jdk.incubator.vector.ByteVector, jdk.incubator.vector.FloatVector, jdk.incubator.vector.IntVector,... (and 5 more).
==end of forbidden APIs==
Missing classes:
  * jdk.incubator.vector.ByteVector
  * jdk.incubator.vector.FloatVector
  * jdk.incubator.vector.IntVector
  * jdk.incubator.vector.ShortVector
  * jdk.incubator.vector.Vector
  * jdk.incubator.vector.VectorOperators
  * jdk.incubator.vector.VectorShape
  * jdk.incubator.vector.VectorSpecies

elasticsearch / periodic / main / openjdk22 / checkRestCompat / java-matrix

        Caused by:  
        java.io.IOException: 408 Request Timeout

elasticsearch / periodic / main / openjdk22 / bwcTestSnapshots / java-matrix

java.io.IOException: 503 Service Unavailable

elasticsearch / periodic / 8.12 / openjdk22 / checkRestCompat / java-matrix

        Caused by:  
        java.lang.RuntimeException: Execution of elasticsearch-keystore failed with exit code 1

elasticsearch / periodic / 8.11 / openjdk22 / checkPart3 / java-matrix

    org.elasticsearch.discovery.MasterNotDiscoveredException    
        at app//org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:317) 
        at app//org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:355)   
        at app//org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:293)    
        at app//org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:645)   
        at app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916)    
        at java.base@22-ea/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)  
        at java.base@22-ea/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)  
        at java.base@22-ea/java.lang.Thread.run(Thread.java:1570)
        Caused by:  
        org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];   
            at app//org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:188)    
...

And more like the above:

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-delivery (Team:Delivery)

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

mark-vieira commented 9 months ago

@rjernst Looks like we have a number of issues on Java 22 EA builds.

mark-vieira commented 9 months ago

The third party audit issue should be resolved by https://github.com/elastic/elasticsearch/pull/104010.

mark-vieira commented 9 months ago

The cluster timeouts are definitely odd. For some reason the cluster never reports as yellow/green. The _health_report endpoint returned the following, and the logs don't show much either.

The _cluster/health endpoint returned a status of red. Below we see a status of unknown. Not sure what that means.

{
  "status": "unknown",
  "cluster_name": "yamlRestTest",
  "indicators": {
    "master_is_stable": {
      "status": "green",
      "symptom": "The cluster has a stable master node",
      "details": {
        "current_master": {
          "node_id": "LQw9CYzFSv6iqnLqK_1xzw",
          "name": "yamlRestTest-0"
        },
        "recent_masters": [
          {
            "node_id": "LQw9CYzFSv6iqnLqK_1xzw",
            "name": "yamlRestTest-0"
          }
        ]
      }
    },
    "repository_integrity": {
      "status": "green",
      "symptom": "No snapshot repositories configured."
    },
    "disk": {
      "status": "unknown",
      "symptom": "No disk usage data."
    },
    "shards_capacity": {
      "status": "unknown",
      "symptom": "Unable to determine shard capacity status."
    },
    "shards_availability": {
      "status": "green",
      "symptom": "This cluster has all shards available.",
      "details": {
        "unassigned_primaries": 0,
        "restarting_replicas": 0,
        "creating_primaries": 0,
        "initializing_replicas": 0,
        "unassigned_replicas": 0,
        "started_primaries": 0,
        "restarting_primaries": 0,
        "initializing_primaries": 0,
        "creating_replicas": 0,
        "started_replicas": 0
      }
    },
    "data_stream_lifecycle": {
      "status": "green",
      "symptom": "Data streams are executing their lifecycles without issues",
      "details": {
        "stagnating_backing_indices_count": 0,
        "total_backing_indices_in_error": 0
      }
    },
    "slm": {
      "status": "green",
      "symptom": "No Snapshot Lifecycle Management policies configured",
      "details": {
        "slm_status": "RUNNING",
        "policies": 0
      }
    },
    "ilm": {
      "status": "green",
      "symptom": "No Index Lifecycle Management policies configured",
      "details": {
        "policies": 0,
        "stagnating_indices": 0,
        "ilm_status": "RUNNING"
      }
    }
  }
}
mark-vieira commented 9 months ago

We see these messages but I presume they are benign. Or perhaps Lucene doesn't actually play nicely with Java 22 at the moment?

[2024-01-05T15:16:36,770][WARN ][o.a.l.s.MMapDirectory    ] [yamlRestTest-0] You are running with Java 22 or later. To make full use of MMapDirectory, please update Apache Lucene.
mark-vieira commented 9 months ago

The keystore issues should be resolved by https://github.com/elastic/elasticsearch/pull/104102. It's possible that's the root cause of many other failures as well.

mark-vieira commented 9 months ago

Ok, the keystore issue is not what I think it is. The timeout issues are still the main problem here and that is because the cluster never becomes yellow. I really need someone else to investigate this. There's nothing in the logs the _health_report isn't giving me useful information.

mark-vieira commented 9 months ago

I'm seeing a few suspect pending tasks on the cluster:

{
  "tasks": [
    {
      "insert_order": 6,
      "priority": "URGENT",
      "source": "reconfigure",
      "executing": false,
      "time_in_queue_millis": 339916,
      "time_in_queue": "5.6m"
    },
    {
      "insert_order": 2,
      "priority": "HIGH",
      "source": "cluster_reroute(post-join reroute)",
      "executing": false,
      "time_in_queue_millis": 339916,
      "time_in_queue": "5.6m"
    },
    {
      "insert_order": 3,
      "priority": "NORMAL",
      "source": "update snapshot after shards changed [false] or node configuration changed [true]",
      "executing": false,
      "time_in_queue_millis": 339916,
      "time_in_queue": "5.6m"
    },
    {
      "insert_order": 4,
      "priority": "NORMAL",
      "source": "SnapshotsService#updateNodeIdsToRemove",
      "executing": false,
      "time_in_queue_millis": 339916,
      "time_in_queue": "5.6m"
    },
    {
      "insert_order": 5,
      "priority": "NORMAL",
      "source": "local-gateway-elected-state",
      "executing": false,
      "time_in_queue_millis": 339916,
      "time_in_queue": "5.6m"
    }
  ]
}
rjernst commented 9 months ago

I found this reproduces simply with ./gradlew run. I ran it with RUNTIME_JAVA_HOME pointing to jdk 22 and get similar pending tasks which never execute. Tracing this with the debugger it seems like the tasks are submitted into the executor, but then never execute.

@DaveCTurner do you have any ideas on debugging further? I don't see why these queues would suddenly stop processing tasks, since they processed tasks earlier in startup (I think?).

DaveCTurner commented 9 months ago

We are getting stuck because we fork a task to MasterService#threadPoolExecutor which simply vanishes without either executing, failing, or being rejected. Try adding this logging:

diff --git a/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java b/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
index f037f4780b28..b3a2ae37f4a5 100644
--- a/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
@@ -1209,6 +1209,7 @@ public class MasterService extends AbstractLifecycleComponent {
     private final Runnable queuesProcessor = new AbstractRunnable() {
         @Override
         public void doRun() {
+            logger.info("--> executing queuesProcessor");
             assert threadPool.getThreadContext().isSystemContext();
             assert totalQueueSize.get() > 0;
             assert currentlyExecutingBatch == null;
@@ -1261,6 +1262,7 @@ public class MasterService extends AbstractLifecycleComponent {

         @Override
         public void onRejection(Exception e) {
+            logger.info("--> rejected queuesProcessor", e);
             assert e instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown() : e;
             drainQueueOnRejection(new FailedToCommitClusterStateException("node closed", e));
         }
@@ -1297,7 +1299,9 @@ public class MasterService extends AbstractLifecycleComponent {
         final var threadContext = threadPool.getThreadContext();
         try (var ignored = threadContext.stashContext()) {
             threadContext.markAsSystemContext();
+            logger.info("--> forking queuesProcessor");
             threadPoolExecutor.execute(queuesProcessor);
+            logger.info("--> forked queuesProcessor");
         }
     }

The first task from the master service logs what you'd expect:

[2024-01-11T09:35:03,610][INFO ][o.e.c.s.MasterService    ] [runTask-0] --> forking queuesProcessor
[2024-01-11T09:35:03,610][INFO ][o.e.c.s.MasterService    ] [runTask-0] --> forked queuesProcessor
[2024-01-11T09:35:03,610][INFO ][o.e.c.s.MasterService    ] [runTask-0] --> executing queuesProcessor

The second one is the problem: the logs just finish here:

[2024-01-11T09:35:03,705][INFO ][o.e.c.s.MasterService    ] [runTask-0] --> forking queuesProcessor
[2024-01-11T09:35:03,705][INFO ][o.e.c.s.MasterService    ] [runTask-0] --> forked queuesProcessor
DaveCTurner commented 9 months ago

Isolating just the problem with the executor service, the following test passes on JDK21 and fails on JDK22-ea+30-2287:

diff --git a/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java b/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
index fb9bde31e8fc..2d6259b97bca 100644
--- a/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
+++ b/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
@@ -12,6 +12,7 @@ import org.elasticsearch.common.settings.Setting;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.common.unit.Processors;
 import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.threadpool.ThreadPool;
 import org.hamcrest.Matcher;

 import java.util.Locale;
@@ -501,4 +502,39 @@ public class EsExecutorsTests extends ESTestCase {
         }
     }

+    public void testWtf() {
+        final var executorService = EsExecutors.newScaling(
+            "test",
+            0,
+            1,
+            60,
+            TimeUnit.SECONDS,
+            true,
+            EsExecutors.daemonThreadFactory("test"),
+            new ThreadContext(Settings.EMPTY)
+        );
+        try {
+            final var countDownLatch = new CountDownLatch(10);
+            class TestTask extends AbstractRunnable {
+                @Override
+                protected void doRun() {
+                    countDownLatch.countDown();
+                    if (countDownLatch.getCount() > 0) {
+                        executorService.execute(TestTask.this);
+                    }
+                }
+
+                @Override
+                public void onFailure(Exception e) {
+                    fail(e);
+                }
+            }
+
+            executorService.execute(new TestTask());
+            safeAwait(countDownLatch);
+        } finally {
+            ThreadPool.terminate(executorService, 10, TimeUnit.SECONDS);
+        }
+    }
+
 }
rjernst commented 9 months ago

With @ChrisHegarty 's help we bisected jdk builds, and tracked the change in behavior down this change to LinkedTransferQueue: https://github.com/openjdk/jdk/commit/8d1ab57065c7ebcc650b5fb4ae098f8b0a35f112

ChrisHegarty commented 9 months ago

I'm going to distill this down to a JDK-only test case and report it to OpenJDK.

ChrisHegarty commented 9 months ago

JDK bug report: https://bugs.openjdk.org/browse/JDK-8323659

ChrisHegarty commented 9 months ago

Note: the underlying change that has caused this problem for Elasticsearch is also ported to JDK 21.0.2. I'm trying to determine if we can squeeze a fix into 21.0.2, but I'm not yet sure given it's due to release soon (Jan 16th), and I've not yet got a review fixing this in 22. We may need to start to consider how to workaround this in ES code :-( (I'll give it some thought)

mark-vieira commented 9 months ago

Great work tracking this down!

ChrisHegarty commented 9 months ago

A change has been merged to Elasticsearch (https://github.com/elastic/elasticsearch/pull/104347) that works around this JDK bug. The change has been backported to various older ES release branches so that they will not run into issues when upgrading to JDK 21.0.2.

Additionally, the JDK bug has been fixed in JDK 22-EA, and will be in the next JDK build promotion. The JDK bug will be backported to JDK 21.0.3 ( too late for 21.0.2 :-( ).

mark-vieira commented 9 months ago

Note: the underlying change that has caused this problem for Elasticsearch is also ported to JDK 21.0.2.

And sure enough when our CI machines updated to this version we started seeing the same failures: https://gradle-enterprise.elastic.co/s/j6caqj7gvr63q

Those builds did not include the changes in https://github.com/elastic/elasticsearch/pull/104347 so we should be able to ignore.

mark-vieira commented 9 months ago

So the openjdk22 jobs are looking a lot better, still seeing the weird keystore cli issue though reading from stdin. @rjernst are you still investigating this?

Also seeing some mockito issues which I presume just require an upgrade: https://gradle-enterprise.elastic.co/s/773dki5avib2a/tests/task/:modules:reindex:test/details/org.elasticsearch.reindex.TransportRethrottleActionTests/testRethrottleCatastrophicFailures?top-execution=1

mark-vieira commented 9 months ago

I've opened a PR for the mockito upgrade: https://github.com/elastic/elasticsearch/pull/104494

rjernst commented 7 months ago

AFAICT, we've been passing tests since Feb 16. (this is just part 1, but I've checked most the openjdk22 jobs since Feb 16) https://buildkite.com/elastic/elasticsearch-periodic/builds/1756#018db04a-58cc-427c-9af3-15c2e8b5f80b

I'm not sure what happened around there. But I think this issue can be closed. @mark-vieira wdyt?

mark-vieira commented 7 months ago

I've re-enabled Java 22 testing on 8.12 and 7.17 branches. I had disabled them due to the noise. There may be fixes we've made that need to be backported there.

ChrisHegarty commented 7 months ago

Looking at the recent 8.13 jobs [1], there are some completely successful ones, but several that have individual failures. Of the failed jobs, I either found existing issues or filed new ones. While there are some issues, the number seems relatively small, and do not appear to be JDK 22 related.

[1] https://buildkite.com/elastic/elasticsearch-periodic/builds?branch=8.13

mark-vieira commented 7 months ago

While there are some issues, the number seems relatively small, and do not appear to be JDK 22 related.

This is my interpretation as well.

rjernst commented 7 months ago

Since we have addressed the major issues causing failures with Java 22, I'm going to close this issue. We can open new specific issues for any future jdk22 failures.