Closed kingherc closed 7 months ago
Pinging @elastic/es-delivery (Team:Delivery)
Pinging @elastic/es-core-infra (Team:Core/Infra)
@rjernst Looks like we have a number of issues on Java 22 EA builds.
The third party audit issue should be resolved by https://github.com/elastic/elasticsearch/pull/104010.
The cluster timeouts are definitely odd. For some reason the cluster never reports as yellow/green. The _health_report
endpoint returned the following, and the logs don't show much either.
The _cluster/health
endpoint returned a status of red
. Below we see a status of unknown
. Not sure what that means.
{
"status": "unknown",
"cluster_name": "yamlRestTest",
"indicators": {
"master_is_stable": {
"status": "green",
"symptom": "The cluster has a stable master node",
"details": {
"current_master": {
"node_id": "LQw9CYzFSv6iqnLqK_1xzw",
"name": "yamlRestTest-0"
},
"recent_masters": [
{
"node_id": "LQw9CYzFSv6iqnLqK_1xzw",
"name": "yamlRestTest-0"
}
]
}
},
"repository_integrity": {
"status": "green",
"symptom": "No snapshot repositories configured."
},
"disk": {
"status": "unknown",
"symptom": "No disk usage data."
},
"shards_capacity": {
"status": "unknown",
"symptom": "Unable to determine shard capacity status."
},
"shards_availability": {
"status": "green",
"symptom": "This cluster has all shards available.",
"details": {
"unassigned_primaries": 0,
"restarting_replicas": 0,
"creating_primaries": 0,
"initializing_replicas": 0,
"unassigned_replicas": 0,
"started_primaries": 0,
"restarting_primaries": 0,
"initializing_primaries": 0,
"creating_replicas": 0,
"started_replicas": 0
}
},
"data_stream_lifecycle": {
"status": "green",
"symptom": "Data streams are executing their lifecycles without issues",
"details": {
"stagnating_backing_indices_count": 0,
"total_backing_indices_in_error": 0
}
},
"slm": {
"status": "green",
"symptom": "No Snapshot Lifecycle Management policies configured",
"details": {
"slm_status": "RUNNING",
"policies": 0
}
},
"ilm": {
"status": "green",
"symptom": "No Index Lifecycle Management policies configured",
"details": {
"policies": 0,
"stagnating_indices": 0,
"ilm_status": "RUNNING"
}
}
}
}
We see these messages but I presume they are benign. Or perhaps Lucene doesn't actually play nicely with Java 22 at the moment?
[2024-01-05T15:16:36,770][WARN ][o.a.l.s.MMapDirectory ] [yamlRestTest-0] You are running with Java 22 or later. To make full use of MMapDirectory, please update Apache Lucene.
The keystore issues should be resolved by https://github.com/elastic/elasticsearch/pull/104102. It's possible that's the root cause of many other failures as well.
Ok, the keystore issue is not what I think it is. The timeout issues are still the main problem here and that is because the cluster never becomes yellow. I really need someone else to investigate this. There's nothing in the logs the _health_report
isn't giving me useful information.
I'm seeing a few suspect pending tasks on the cluster:
{
"tasks": [
{
"insert_order": 6,
"priority": "URGENT",
"source": "reconfigure",
"executing": false,
"time_in_queue_millis": 339916,
"time_in_queue": "5.6m"
},
{
"insert_order": 2,
"priority": "HIGH",
"source": "cluster_reroute(post-join reroute)",
"executing": false,
"time_in_queue_millis": 339916,
"time_in_queue": "5.6m"
},
{
"insert_order": 3,
"priority": "NORMAL",
"source": "update snapshot after shards changed [false] or node configuration changed [true]",
"executing": false,
"time_in_queue_millis": 339916,
"time_in_queue": "5.6m"
},
{
"insert_order": 4,
"priority": "NORMAL",
"source": "SnapshotsService#updateNodeIdsToRemove",
"executing": false,
"time_in_queue_millis": 339916,
"time_in_queue": "5.6m"
},
{
"insert_order": 5,
"priority": "NORMAL",
"source": "local-gateway-elected-state",
"executing": false,
"time_in_queue_millis": 339916,
"time_in_queue": "5.6m"
}
]
}
I found this reproduces simply with ./gradlew run
. I ran it with RUNTIME_JAVA_HOME pointing to jdk 22 and get similar pending tasks which never execute. Tracing this with the debugger it seems like the tasks are submitted into the executor, but then never execute.
@DaveCTurner do you have any ideas on debugging further? I don't see why these queues would suddenly stop processing tasks, since they processed tasks earlier in startup (I think?).
We are getting stuck because we fork a task to MasterService#threadPoolExecutor
which simply vanishes without either executing, failing, or being rejected. Try adding this logging:
diff --git a/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java b/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
index f037f4780b28..b3a2ae37f4a5 100644
--- a/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
@@ -1209,6 +1209,7 @@ public class MasterService extends AbstractLifecycleComponent {
private final Runnable queuesProcessor = new AbstractRunnable() {
@Override
public void doRun() {
+ logger.info("--> executing queuesProcessor");
assert threadPool.getThreadContext().isSystemContext();
assert totalQueueSize.get() > 0;
assert currentlyExecutingBatch == null;
@@ -1261,6 +1262,7 @@ public class MasterService extends AbstractLifecycleComponent {
@Override
public void onRejection(Exception e) {
+ logger.info("--> rejected queuesProcessor", e);
assert e instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown() : e;
drainQueueOnRejection(new FailedToCommitClusterStateException("node closed", e));
}
@@ -1297,7 +1299,9 @@ public class MasterService extends AbstractLifecycleComponent {
final var threadContext = threadPool.getThreadContext();
try (var ignored = threadContext.stashContext()) {
threadContext.markAsSystemContext();
+ logger.info("--> forking queuesProcessor");
threadPoolExecutor.execute(queuesProcessor);
+ logger.info("--> forked queuesProcessor");
}
}
The first task from the master service logs what you'd expect:
[2024-01-11T09:35:03,610][INFO ][o.e.c.s.MasterService ] [runTask-0] --> forking queuesProcessor
[2024-01-11T09:35:03,610][INFO ][o.e.c.s.MasterService ] [runTask-0] --> forked queuesProcessor
[2024-01-11T09:35:03,610][INFO ][o.e.c.s.MasterService ] [runTask-0] --> executing queuesProcessor
The second one is the problem: the logs just finish here:
[2024-01-11T09:35:03,705][INFO ][o.e.c.s.MasterService ] [runTask-0] --> forking queuesProcessor
[2024-01-11T09:35:03,705][INFO ][o.e.c.s.MasterService ] [runTask-0] --> forked queuesProcessor
Isolating just the problem with the executor service, the following test passes on JDK21 and fails on JDK22-ea+30-2287:
diff --git a/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java b/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
index fb9bde31e8fc..2d6259b97bca 100644
--- a/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
+++ b/server/src/test/java/org/elasticsearch/common/util/concurrent/EsExecutorsTests.java
@@ -12,6 +12,7 @@ import org.elasticsearch.common.settings.Setting;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.unit.Processors;
import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.threadpool.ThreadPool;
import org.hamcrest.Matcher;
import java.util.Locale;
@@ -501,4 +502,39 @@ public class EsExecutorsTests extends ESTestCase {
}
}
+ public void testWtf() {
+ final var executorService = EsExecutors.newScaling(
+ "test",
+ 0,
+ 1,
+ 60,
+ TimeUnit.SECONDS,
+ true,
+ EsExecutors.daemonThreadFactory("test"),
+ new ThreadContext(Settings.EMPTY)
+ );
+ try {
+ final var countDownLatch = new CountDownLatch(10);
+ class TestTask extends AbstractRunnable {
+ @Override
+ protected void doRun() {
+ countDownLatch.countDown();
+ if (countDownLatch.getCount() > 0) {
+ executorService.execute(TestTask.this);
+ }
+ }
+
+ @Override
+ public void onFailure(Exception e) {
+ fail(e);
+ }
+ }
+
+ executorService.execute(new TestTask());
+ safeAwait(countDownLatch);
+ } finally {
+ ThreadPool.terminate(executorService, 10, TimeUnit.SECONDS);
+ }
+ }
+
}
With @ChrisHegarty 's help we bisected jdk builds, and tracked the change in behavior down this change to LinkedTransferQueue: https://github.com/openjdk/jdk/commit/8d1ab57065c7ebcc650b5fb4ae098f8b0a35f112
I'm going to distill this down to a JDK-only test case and report it to OpenJDK.
JDK bug report: https://bugs.openjdk.org/browse/JDK-8323659
Note: the underlying change that has caused this problem for Elasticsearch is also ported to JDK 21.0.2. I'm trying to determine if we can squeeze a fix into 21.0.2, but I'm not yet sure given it's due to release soon (Jan 16th), and I've not yet got a review fixing this in 22. We may need to start to consider how to workaround this in ES code :-( (I'll give it some thought)
Great work tracking this down!
A change has been merged to Elasticsearch (https://github.com/elastic/elasticsearch/pull/104347) that works around this JDK bug. The change has been backported to various older ES release branches so that they will not run into issues when upgrading to JDK 21.0.2.
Additionally, the JDK bug has been fixed in JDK 22-EA, and will be in the next JDK build promotion. The JDK bug will be backported to JDK 21.0.3 ( too late for 21.0.2 :-( ).
Note: the underlying change that has caused this problem for Elasticsearch is also ported to JDK 21.0.2.
And sure enough when our CI machines updated to this version we started seeing the same failures: https://gradle-enterprise.elastic.co/s/j6caqj7gvr63q
Those builds did not include the changes in https://github.com/elastic/elasticsearch/pull/104347 so we should be able to ignore.
So the openjdk22 jobs are looking a lot better, still seeing the weird keystore cli issue though reading from stdin. @rjernst are you still investigating this?
Also seeing some mockito issues which I presume just require an upgrade: https://gradle-enterprise.elastic.co/s/773dki5avib2a/tests/task/:modules:reindex:test/details/org.elasticsearch.reindex.TransportRethrottleActionTests/testRethrottleCatastrophicFailures?top-execution=1
I've opened a PR for the mockito upgrade: https://github.com/elastic/elasticsearch/pull/104494
AFAICT, we've been passing tests since Feb 16. (this is just part 1, but I've checked most the openjdk22 jobs since Feb 16) https://buildkite.com/elastic/elasticsearch-periodic/builds/1756#018db04a-58cc-427c-9af3-15c2e8b5f80b
I'm not sure what happened around there. But I think this issue can be closed. @mark-vieira wdyt?
I've re-enabled Java 22 testing on 8.12 and 7.17 branches. I had disabled them due to the noise. There may be fixes we've made that need to be backported there.
Looking at the recent 8.13 jobs [1], there are some completely successful ones, but several that have individual failures. Of the failed jobs, I either found existing issues or filed new ones. While there are some issues, the number seems relatively small, and do not appear to be JDK 22 related.
[1] https://buildkite.com/elastic/elasticsearch-periodic/builds?branch=8.13
While there are some issues, the number seems relatively small, and do not appear to be JDK 22 related.
This is my interpretation as well.
Since we have addressed the major issues causing failures with Java 22, I'm going to close this issue. We can open new specific issues for any future jdk22 failures.
CI Link
https://gradle-enterprise.elastic.co/s/iamuhpqjsu2mc
Repro line
N/A
Does it reproduce?
Didn't try
Applicable branches
main, 8.12 etc.
Failure history
No response
Failure excerpt
Since yesterday, there seem to be a lot of build scan failures around openjdk22. I am opening a single ticket in the hope there's some single cause for all of them.
Here are some examples of the broad types I've observed.
elasticsearch / periodic / main / openjdk22 / checkPart1 / java-matrix
elasticsearch / periodic / main / openjdk22 / checkRestCompat / java-matrix
elasticsearch / periodic / main / openjdk22 / bwcTestSnapshots / java-matrix
elasticsearch / periodic / 8.12 / openjdk22 / checkRestCompat / java-matrix
elasticsearch / periodic / 8.11 / openjdk22 / checkPart3 / java-matrix
And more like the above: