elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.49k stars 24.88k forks source link

[CI] SimpleThreadPoolIT testThreadPoolMetrics failing #108320

Open jfreden opened 6 months ago

jfreden commented 6 months ago

Build scan: https://gradle-enterprise.elastic.co/s/yebknro47qsu4/tests/:server:internalClusterTest/org.elasticsearch.threadpool.SimpleThreadPoolIT/testThreadPoolMetrics

Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.threadpool.SimpleThreadPoolIT.testThreadPoolMetrics" -Dtests.seed=8FA8BE6EA7422389 -Dtests.locale=ja-JP-u-ca-japanese-x-lvariant-JP -Dtests.timezone=Europe/Zaporozhye -Druntime.java=21

Applicable branches: main

Reproduces locally?: Didn't try

Failure history: Failure dashboard for org.elasticsearch.threadpool.SimpleThreadPoolIT#testThreadPoolMetrics&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testThreadPoolMetrics'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.threadpool.SimpleThreadPoolIT'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium)))))

Failure excerpt:

java.lang.AssertionError: 
Expected: map containing ["search.threads.queue.size"->iterable containing [a value equal to or greater than <1L>]]
     but: map was [<search.threads.active.current=[0]>, <search.threads.completed.total=[476]>, <search.threads.count.current=[7]>, <search.threads.largest.current=[7]>, <search.threads.queue.size=[0]>]

  at __randomizedtesting.SeedInfo.seed([8FA8BE6EA7422389:F9DF9EC5C9353D17]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
  at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2150)
  at org.elasticsearch.threadpool.SimpleThreadPoolIT.lambda$testThreadPoolMetrics$4(SimpleThreadPoolIT.java:188)
  at java.util.TreeMap.forEach(TreeMap.java:1317)
  at java.util.Collections$UnmodifiableMap.forEach(Collections.java:1707)
  at org.elasticsearch.threadpool.SimpleThreadPoolIT.lambda$testThreadPoolMetrics$5(SimpleThreadPoolIT.java:187)
  at java.lang.Iterable.forEach(Iterable.java:75)
  at org.elasticsearch.threadpool.SimpleThreadPoolIT.testThreadPoolMetrics(SimpleThreadPoolIT.java:161)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)
elasticsearchmachine commented 6 months ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

pgomulka commented 6 months ago

this test is failing because of search.threads.queue.size being different in threadpool stats and metric.


[0006-05-03T15:16:51,033][INFO ][o.e.t.SimpleThreadPoolIT ] [testThreadPoolMetrics] Stats of `search`: {search.threads.active.current=0, search.threads.completed.total=475, search.threads.count.current=7, search.threads.largest.current=7, search.threads.queue.size=1} |  
-- | --
  | [0006-05-03T15:16:51,033][INFO ][o.e.t.SimpleThreadPoolIT ] [testThreadPoolMetrics] Measurements of `search`: {search.threads.active.current=[0], search.threads.completed.total=[476], search.threads.count.current=[7], search.threads.largest.current=[7], search.threads.queue.size=[0]}

we are waiting for the threadpool stats to report that there is no active thread. A line later we collect metric measurments. I suspect that in that moment there might be a new thread submitted hence the threadpool stat is reporting 0, when the metric mesurment is reporting 1.

@mosche wdyt? you worked on hardening this test before

pgomulka commented 6 months ago

I wonder if there is a way to reliably and gently shutdown a threadpool or EsIntegTest so that we have a 'frozen' es node that we can assert about

mosche commented 6 months ago

I see a few options:

pgomulka commented 6 months ago

test is failing due to threadpool stats vs apm metric discrepancy. most likely a timing issue assesing the risk to low

ywangd commented 5 months ago

Failed again https://gradle-enterprise.elastic.co/s/prkn746ltk5zw/tests/task/:server:internalClusterTest/details/org.elasticsearch.threadpool.SimpleThreadPoolIT/testThreadPoolMetrics?top-execution=1

elasticsearchmachine commented 1 month ago

This has been muted on branch main

Mute Reasons:

Build Scans:

elasticsearchmachine commented 1 month ago

This has been muted on branch 8.x

Mute Reasons:

Build Scans:

elasticsearchmachine commented 1 month ago

This has been muted on branch 8.x

Mute Reasons:

Build Scans:

elasticsearchmachine commented 3 weeks ago

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

elasticsearchmachine commented 3 weeks ago

This has been muted on branch main

Mute Reasons:

Build Scans:

elasticsearchmachine commented 2 weeks ago

This has been muted on branch 8.x

Mute Reasons:

Build Scans:

elasticsearchmachine commented 3 days ago

This has been muted on branch 8.16

Mute Reasons:

Build Scans: