Closed albertzaharovits closed 3 weeks ago
Pinging @elastic/es-core-infra (Team:Core/Infra)
ping @williamrandolph since you edited the test yesterday.
I don't know that I'll have time to get to this today, but I'll get a look as soon as I can. In the meantime I'm going to go ahead and mute. https://github.com/elastic/elasticsearch/pull/107630
judging by the thread dump it is blocked for 19minutes on:
client().prepareSearch(USER_INDEX)
.setQuery(QueryBuilders.matchAllQuery())
// Request times out if max concurrent shard requests is set to 1
.setMaxConcurrentShardRequests(usually() ? SearchRequest.DEFAULT_MAX_CONCURRENT_SHARD_REQUESTS : randomIntBetween(2, 10))
.get()
"TEST-KibanaThreadPoolIT.testUserThreadPoolsAreBlocked-seed#[AB39E5A32B6FC95C]" ID=34 WAITING on org.elasticsearch.action.support.PlainActionFuture$Sync@7c36647d
at java.base@17.0.2/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on org.elasticsearch.action.support.PlainActionFuture$Sync@7c36647d
at java.base@17.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.base@17.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
at java.base@17.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1047)
at app//org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:273)
at app//org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:93)
at app//org.elasticsearch.client.internal.support.AbstractClient$RefCountedFuture.get(AbstractClient.java:1535)
at app//org.elasticsearch.client.internal.support.AbstractClient$RefCountedFuture.get(AbstractClient.java:1515)
at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
at app//org.elasticsearch.action.support.PlainActionFuture.actionGet(PlainActionFuture.java:152)
at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:43)
at app//org.elasticsearch.indices.SystemIndexThreadPoolTestCase.lambda$assertThreadPoolsBlocked$3(SystemIndexThreadPoolTestCase.java:96)
at app//org.elasticsearch.indices.SystemIndexThreadPoolTestCase$$Lambda$3815/0x000000080161f060.run(Unknown Source)
at app//org.apache.lucene.tests.util.LuceneTestCase._expectThrows(LuceneTestCase.java:3040)
at app//org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2877)
at app//org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2868)
at app//org.elasticsearch.indices.SystemIndexThreadPoolTestCase.assertThreadPoolsBlocked(SystemIndexThreadPoolTestCase.java:90)
at app//org.elasticsearch.indices.SystemIndexThreadPoolTestCase$$Lambda$3783/0x000000080160ef40.run(Unknown Source)
at app//org.elasticsearch.indices.SystemIndexThreadPoolTestCase.runWithBlockedThreadPools(SystemIndexThreadPoolTestCase.java:66)
at app//org.elasticsearch.indices.SystemIndexThreadPoolTestCase.testUserThreadPoolsAreBlocked(SystemIndexThreadPoolTestCase.java:75)
at java.base@17.0.2/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base@17.0.2/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base@17.0.2/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
there is also this in logs:
1> [2024-04-24T15:39:25,570][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][3] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,570][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][0] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,570][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][2] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,570][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][1] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,671][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][4] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,673][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][5] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,675][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][6] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,677][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][7] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,726][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][8] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
1> [2024-04-24T15:39:25,727][WARN ][o.e.i.s.IndexShard ] [node_s0] [user_index][9] performing expensive diagnostic checks during shard startup [index.shard.check_on_startup=true]; these checks should only be enabled temporarily, you must remove this index setting as soon as possible
I wonder if it is relevant
I don't think there were changes around this functionality recently. I would lower the status of this medium-risk
I am not sure if I got into the wrong direction here. but one way to reproduce this is to allow for the task to end up on a queue for search threadpool. so we would have to modify the fillThreadPoolQueues and run with smaller timeout ./gradlew ':modules:kibana:internalClusterTest' --tests "org.elasticsearch.kibana.KibanaThreadPoolIT" -Dtests.timeoutSuite=30000!
long queueSize = info.getQueueSize().singles();
queueSize = threadPoolName.equals("search") ?queueSize - 1 : queueSize;
for (int i = 0; i < queueSize; i++) {
however I cannot confirm how this would be even possible without this change. The search on user's index should fail - be rejected by the executor, but for some reason it got stuck.
Perhaps we could add some trace logging around submitting a task to check the queue size ? SearchService#runAsync
there was still a failure after the refactor. https://gradle-enterprise.elastic.co/s/y7yaqmdj7d22q/tests/task/:modules:kibana:internalClusterTest/details/org.elasticsearch.kibana.KibanaThreadPoolIT/testBlockedThreadPoolsRejectUserRequests?top-execution=1 I added more debugging
with the additional logging I cannot see anytask to be removed. the queues were also correctly filled with tasks from the test
Thread pool details get EsThreadPoolExecutor[name = node_s0/get, queue capacity = 1, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@50982f90[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details write TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s0/write, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@30e1ebe9[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details search TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s0/search, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@1a44ef92[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details get EsThreadPoolExecutor[name = node_s1/get, queue capacity = 1, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3088726f[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details write TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s1/write, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@6b99aea9[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details search TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s1/search, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@35444ad[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details get EsThreadPoolExecutor[name = node_s2/get, queue capacity = 1, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@35b9007e[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details write TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s2/write, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@cabe7d3[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details search TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s2/search, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@7a02f741[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details get EsThreadPoolExecutor[name = node_s3/get, queue capacity = 1, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@372b9f1[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details write TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s3/write, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@38a2baf2[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
Thread pool details search TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s3/search, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@4ac2fd93[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
[org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007faf3fa5a760@4cd609d4]
I have added additional debuging and confirmed that search tasks will 'force put' themselves on the threadpool queue https://gradle-enterprise.elastic.co/s/iscjlj2u2clcs
1> [2024-05-24T02:05:25,046][INFO ][o.e.a.s.TransportSearchAction] [node_s0] FORK FORK
1> [2024-05-24T02:05:25,047][INFO ][o.e.c.u.c.EsAbortPolicy ] [node_s0] FORCE PUT TimedRunnable{original=org.elasticsearch.action.search.AbstractSearchAsyncAction$2@6c666e45, creationTimeNanos=71614759915791, startTimeNanos=0, finishTimeNanos=-1, failedOrRejected=false} TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s0/search, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@df9c20a[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
1> [2024-05-24T02:05:45,015][DEBUG][o.e.k.KibanaThreadPoolIT ] [[Thread-5]] Thread pool details get EsThreadPoolExecutor[name = node_s0/get, queue capacity = 1, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@77778b3c[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
1> [2024-05-24T02:05:45,016][DEBUG][o.e.k.KibanaThreadPoolIT ] [[Thread-5]] [org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007e0001b06780@a3d81fc]
1> [2024-05-24T02:05:45,016][DEBUG][o.e.k.KibanaThreadPoolIT ] [[Thread-5]] Thread pool details write TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s0/write, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@36701621[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]]
1> [2024-05-24T02:05:45,016][DEBUG][o.e.k.KibanaThreadPoolIT ] [[Thread-5]] [org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007e0001b06780@a3d81fc]
1> [2024-05-24T02:05:45,016][DEBUG][o.e.k.KibanaThreadPoolIT ] [[Thread-5]] Thread pool details search TaskExecutionTimeTrackingEsThreadPoolExecutor[name = node_s0/search, queue capacity = 1, task execution EWMA = 0s, total task execution time = 0s, org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor@df9c20a[Running, pool size = 1, active threads = 1, queued tasks = 2, completed tasks = 0]]
1> [2024-05-24T02:05:45,016][DEBUG][o.e.k.KibanaThreadPoolIT ] [[Thread-5]] [org.elasticsearch.kibana.KibanaThreadPoolIT$$Lambda/0x00007e0001b06780@a3d81fc, org.elasticsearch.action.search.AbstractSearchAsyncAction$2@6c666e45]
In a log snipet above you can see 2 tasks enqueued on a search threadpool when the queue was configured to only have size 1. https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java#L398
a little more debugging and it turns out that when the test blocks it has to fall into the fork
branch https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java#L831
this only happens when the originalThread = current thread = elasticsearch[node_s0][transport_worker][T#1]
when the handling of that failure is on a test thread
(why woudl that be?? something with the test setup?) it would pass
originalThread = elasticsearch[node_s0][transport_worker][T#1]
currentThread = TEST-KibanaThreadPoolIT.testBlockedThreadPoolsRejectUserRequests-seed#[3C76559446F8C5C]
stacktrace that is triggering that force execution
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNext(AbstractSearchAsyncAction.java:845) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNext(AbstractSearchAsyncAction.java:839) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:341) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:53) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:636) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$UnregisterChildTransportResponseHandler.handleException(TransportService.java:1752) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
1> at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1476) ~[elasticsearch-8.15.0-SNAPSHOT.jar:8.15.0-SNAPSHOT]
I wonder if we are not observing a side effect of https://github.com/elastic/elasticsearch/issues/97916 here sometimes an execution is forked but sometimes it is executed in Test's thread https://github.com/elastic/elasticsearch/issues/97916 https://github.com/elastic/elasticsearch/pull/106915#discussion_r1544626612
A few failures since yesterday https://es-delivery-stats.elastic.dev/app/r/s/gJfOd , all with:
Build scan: https://gradle-enterprise.elastic.co/s/gjpjhnts54xoa/tests/:modules:kibana:internalClusterTest/org.elasticsearch.kibana.KibanaThreadPoolIT/testUserThreadPoolsAreBlocked
Reproduction line:
Applicable branches: main
Reproduces locally?: Didn't try
Failure history: Failure dashboard for
org.elasticsearch.kibana.KibanaThreadPoolIT#testUserThreadPoolsAreBlocked
&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testUserThreadPoolsAreBlocked'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.kibana.KibanaThreadPoolIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))Failure excerpt: