elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.19k stars 24.84k forks source link

[CI] Incorrect watch count in watcher stats api in tests #52453

Closed martijnvg closed 4 years ago

martijnvg commented 4 years ago

SmokeTestWatcherTestSuiteIT Failure:

java.lang.AssertionError: Watch count (from _watcher/stats) |  
-- | --
  | Expected: is <0> |  
  | but: was <1> |  
  | at __randomizedtesting.SeedInfo.seed([F316E92770C6654C:CE704783A70AC6]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.assertWatchCount(SmokeTestWatcherTestSuiteIT.java:270) |  
  | at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.lambda$testMonitorClusterHealth$3(SmokeTestWatcherTestSuiteIT.java:170) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:880) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:853) |  
  | at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth(SmokeTestWatcherTestSuiteIT.java:170)

The failure matches with recent failures reported in #32299. The #51466 fix didn't make this test stop from failing.

The failure has failed a few times now&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'class:SmokeTestWatcherWithSecurityIT+OR+class:SmokeTestWatcherTestSuiteIT'),sort:!(time,desc))) and needs to be re-investigated.

Build failures:

WatchAckTests.testAckAllActions failure:

Build log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+part2/3568/console

Build scan: https://gradle-enterprise.elastic.co/s/ua3yon2njbyja

Failure:

java.lang.AssertionError: 
12:01:55     Expected: is <1L>
12:01:55          but: was <0L>
12:01:55         at __randomizedtesting.SeedInfo.seed([6DEA8487F6CD3068:BAC8F0266D2192C2]:0)
12:01:55         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
12:01:55         at org.junit.Assert.assertThat(Assert.java:956)
12:01:55         at org.junit.Assert.assertThat(Assert.java:923)
12:01:55         at org.elasticsearch.xpack.watcher.test.integration.WatchAckTests.testAckAllActions(WatchAckTests.java:136)
12:01:55 REPRODUCE WITH: ./gradlew ':x-pack:plugin:watcher:test' --tests "org.elasticsearch.xpack.watcher.test.integration.WatchAckTests.testAckAllActions" -Dtests.seed=6DEA8487F6CD3068 -Dtests.security.manager=true -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=Africa/Bamako -Dcompiler.java=13
12:01:55 
12:01:55 Suite: Test class org.elasticsearch.xpack.watcher.test.integration.WatchAckTests

Reproduce with:

./gradlew ':x-pack:plugin:watcher:test' --tests "org.elasticsearch.xpack.watcher.test.integration.WatchAckTests.testAckAllActions" -Dtests.seed=6DEA8487F6CD3068 -Dtests.security.manager=true -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=Africa/Bamako -Dcompiler.java=13

Can't reproduce locally.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/Watcher)

martijnvg commented 4 years ago

Reported by @dakrone in #33326:

09:55:26 
09:55:26 org.elasticsearch.smoketest.SmokeTestWatcherWithSecurityClientYamlTestSuiteIT > test {yaml=watcher/usage/10_basic/Test watcher usage stats output} FAILED
09:55:26     java.lang.AssertionError: Failure at [watcher/usage/10_basic:48]: field [watcher.count.active] is not greater than [$watch_count_active]
09:55:26     Expected: a value greater than <1>
09:55:26          but: <1> was equal to <1>
09:55:26         at __randomizedtesting.SeedInfo.seed([3AA5E1A0040CF825:B2F1DE7AAAF095DD]:0)
09:55:26         at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:405)
09:55:26         at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:382)
09:55:26         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
09:55:26         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

I was unable to reproduce this on the 7.x branch.

Failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_RUNTIME_JAVA=zulu11,nodes=general-purpose/538/consoleFull https://gradle-enterprise.elastic.co/s/xxhruxec3jeji

martijnvg commented 4 years ago

This (^) is another test that failed because incorrect stats counts are reported.

I suspect the main cause of these failures is that watcher, is not fully started on all shard instances that it serves watches from. More specifically the WatcherIndexingListener maybe inactive for a specific shard. We change the tests to ensure that watcher is fully started, but on the other hand we can change the put watch api to check whether the WatcherIndexingListener is active prior to indexing. If it not ready wait similar to the timeout on index request (waiting for enough shard copies to be ready prior to indexing)?

mark-vieira commented 4 years ago

SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed twice again today. I assume due to the nature of the underlying cause that muting isn't practical.

martijnvg commented 4 years ago

I want to see how these tests respond to #52627.

Otherwise I think we should investigate changing the watcher put and delete APIs to wait for the watch to be added to the trigger service before returning a response. Tests assume that this always happens, but that is not the case. In the meantime specific tests can be muted.

imotov commented 4 years ago

@martijnvg it looks like it just failed https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-7&&immutable/611/console

martijnvg commented 4 years ago

I will mute the SmokeTestWatcherTestSuiteIT#testMonitorClusterHealth for now in master and 7.x

Update: I've used the wrong issue id in the commit message 🤦‍♂ master, 7.x

jbaiera commented 4 years ago

There was a failure today that was related to https://github.com/elastic/elasticsearch/issues/33326, which looks like this issue replaces.

https://gradle-enterprise.elastic.co/s/dhqipxj2iuyfc

jakelandis commented 4 years ago

There are no instances of these failures since May 11 (There are a couple SSL failures in a FIPs container ... but that is not what this issue is about) This corresponds with #56556 was introduced to help address issues like this.

image image