elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.94k stars 24.74k forks source link

CoreFullClusterRestartIT » testRollover {cluster=OLD} #97809

Closed kingherc closed 9 months ago

kingherc commented 1 year ago

CI Link

https://gradle-enterprise.elastic.co/s/3cqzv26af7vyq

Repro line

./gradlew ':x-pack:qa:full-cluster-restart:v7.17.12#bwcTest' -Dtests.class="org.elasticsearch.xpack.restart.CoreFullClusterRestartIT" -Dtests.method="testRollover {cluster=OLD}" -Dtests.seed=4B35DA6793FFAAD3 -Dtests.bwc=true -Dtests.locale=ca-ES -Dtests.timezone=America/Rankin_Inlet -Druntime.java=20

Does it reproduce?

Didn't try

Applicable branches

8.8

Failure history

No response

Failure excerpt

org.elasticsearch.xpack.restart.CoreFullClusterRestartIT > testRollover {cluster=OLD} FAILED    
    org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40047], URI [/_cat/indices?v], status line [HTTP/1.1 404 Not Found]    
    {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [.geoip_databases]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":".geoip_databases","index":".geoip_databases"}],"type":"index_not_found_exception","reason":"no such index [.geoip_databases]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":".geoip_databases","index":".geoip_databases"},"status":404}   
        at __randomizedtesting.SeedInfo.seed([4B35DA6793FFAAD3:C028D71F7D53C62C]:0) 
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)    
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) 
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) 
        at app//org.elasticsearch.upgrades.FullClusterRestartIT.testRollover(FullClusterRestartIT.java:637) 
        at java.base@20.0.2/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) 
        at java.base@20.0.2/java.lang.reflect.Method.invoke(Method.java:578)    
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)  
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)   
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)   
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)  
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)    
        at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)  
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)    
        at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)    
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)  
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)    
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)    
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)    
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) 
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) 
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)    
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)   
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)   
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)   
        at app//org.elasticsearch.test.cluster.local.LocalElasticsearchCluster$1.evaluate(LocalElasticsearchCluster.java:39)    
        at app//org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)   
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)    
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)    
        at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)  
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)    
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)    
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)    
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)    
        at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)  
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)    
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)    
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)  
        at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)  
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)    
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)    
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)    
        at java.base@20.0.2/java.lang.Thread.run(Thread.java:1623)
elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

joegallo commented 9 months ago

This is an interesting one.

https://github.com/elastic/elasticsearch/blob/fdc1dcddf359fc9e25c31d8f78820629374f8659/qa/full-cluster-restart/src/javaRestTest/java/org/elasticsearch/upgrades/FullClusterRestartIT.java#L631-L644

^ here's the code in question. We're calling GET _cat/indices?v and then the idea is to parse the output to see if the index we're expecting to exist does actually exist (i.e. the index called testrollover-000002).

But we get an error response!

org.elasticsearch.xpack.restart.CoreFullClusterRestartIT > testRollover {cluster=OLD} FAILED    
    org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40047], URI [/_cat/indices?v], status line [HTTP/1.1 404 Not Found]    
    {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [.geoip_databases]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":".geoip_databases","index":".geoip_databases"}],"type":"index_not_found_exception","reason":"no such index [.geoip_databases]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":".geoip_databases","index":".geoip_databases"},"status":404}   
        at __randomizedtesting.SeedInfo.seed([4B35DA6793FFAAD3:C028D71F7D53C62C]:0) 
[...]

And here's a snippet of logs:

[2023-07-19T06:07:57,831][INFO ][o.e.i.g.GeoIpDownloader  ] [test-cluster-1] updating geoip databases
[2023-07-19T06:07:57,831][INFO ][o.e.i.g.GeoIpDownloader  ] [test-cluster-1] fetching geoip databases overview from [https://geoip.elastic.co/v1/database?elastic_geoip_service_tos=agree]
[2023-07-19T06:07:58,228][INFO ][o.e.c.m.MetadataCreateIndexService] [test-cluster-0] [testrollover-000002] creating index, cause [rollover_index], templates [], shards [1]/[1]
[2023-07-19T06:07:58,388][INFO ][o.e.c.m.MetadataCreateIndexService] [test-cluster-0] [.geoip_databases] creating index, cause [auto(bulk api)], templates [], shards [1]/[0]
[2023-07-19T06:07:58,389][INFO ][o.e.c.r.a.AllocationService] [test-cluster-0] updating number_of_replicas to [1] for indices [.geoip_databases]
[2023-07-19T06:07:58,477][INFO ][o.e.x.r.CoreFullClusterRestartIT] [testRollover] [cluster=OLD] after test

At this stage of things, I don't think that the .geoip_databases index is notably special -- sure it's a system index and it's behind an alias (IIRC), but the part that worries me here is that we're asking Elasticsearch for GET _cat/indices output and we're getting an error because of some kind of internal race condition around index creation.

joegallo commented 9 months ago

Okay, I reproduced this on a 3 node 7.17.2 cluster. Here's the stack trace:

{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [2cac048b854d86e9e7706fa7]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"2cac048b854d86e9e7706fa7","index":"2cac048b854d86e9e770[235/6125]
ck_trace":"[2cac048b854d86e9e7706fa7] IndexNotFoundException[no such index [2cac048b854d86e9e7706fa7]]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.indexNotFoundException(IndexNameExpressionResolver.java:1250)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.innerResolve(IndexNameExpressionResolver.java:1188)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.resolve(IndexNameExpressionResolver.java:1144)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:292)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:270)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:92)
    at org.elasticsearch.action.admin.indices.settings.get.TransportGetSettingsAction.checkBlock(TransportGetSettingsAction.java:64)
    at org.elasticsearch.action.admin.indices.settings.get.TransportGetSettingsAction.checkBlock(TransportGetSettingsAction.java:31)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.checkBlockIfStateRecovered(TransportMasterNodeAction.java:138)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.access$000(TransportMasterNodeAction.java:52)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:185)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:158)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:52)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:179)
    at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:53)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:177)
    at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$3(SecurityActionFilter.java:190)
    at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219)
    at org.elasticsearch.xpack.security.authz.AuthorizationService$1.onResponse(AuthorizationService.java:577)
    at org.elasticsearch.xpack.security.authz.AuthorizationService$1.onResponse(AuthorizationService.java:571)
    at org.elasticsearch.xpack.security.authz.interceptor.FieldAndDocumentLevelSecurityRequestInterceptor.intercept(FieldAndDocumentLevelSecurityRequestInterceptor.java:85)
    at org.elasticsearch.xpack.security.authz.interceptor.SearchRequestInterceptor.intercept(SearchRequestInterceptor.java:26)
    at org.elasticsearch.xpack.security.authz.AuthorizationService$1.onResponse(AuthorizationService.java:575)
    at org.elasticsearch.xpack.security.authz.AuthorizationService$1.onResponse(AuthorizationService.java:571)

I caught it by spinning up a cluster then running these two commands in different terminals:

while true; do index=$(openssl rand -hex 12); curl $ES_CREDS -XPUT $ES_URL/$index; echo; sleep 1; curl $ES_CREDS -XDELETE $ES_URL/$index; echo; done
while true; do curl $ES_CREDS "$ES_URL/_cat/indices?v&error_trace"; echo; done
joegallo commented 9 months ago

Here's a 8.11.3 stack trace -- I was hoping maybe https://github.com/elastic/elasticsearch/pull/93437 would have invalidated the circumstances where this bug happened (🤷) but no such luck.

{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [77af693ebd14b44e71d1ddaf]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"77af693ebd14b44e71d1ddaf","index":"77af693ebd14b44e71d1ddaf","stack_trace":"[77af693ebd14b44e71d1ddaf] org.elasticsearch.index.IndexNotFoundException: no such index [77af693ebd14b44e71d1ddaf]
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.notFoundException(IndexNameExpressionResolver.java:474)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$ExplicitResourceNameFilter.ensureAliasOrIndexExists(IndexNameExpressionResolver.java:1604)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$ExplicitResourceNameFilter.filterUnavailable(IndexNameExpressionResolver.java:1584)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.resolveExpressions(IndexNameExpressionResolver.java:266)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:341)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:332)
    at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:91)
    at org.elasticsearch.action.admin.indices.settings.get.TransportGetSettingsAction.checkBlock(TransportGetSettingsAction.java:70)
    at org.elasticsearch.action.admin.indices.settings.get.TransportGetSettingsAction.checkBlock(TransportGetSettingsAction.java:37)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.checkBlockIfStateRecovered(TransportMasterNodeAction.java:136)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:201)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:173)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:55)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:87)
    at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:53)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:85)
joegallo commented 9 months ago

At least on 8.11.3 it affects _cat/shards, too (I assume the same applies to previous versions).

{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [47489aeb9ba1353816a4d4ab]","resource.type":"index_or_alias","resource.id":"47489aeb9ba1353816a4d4ab","index_uuid":"_na_","index":"47489aeb9ba1353816a4d4ab","stack_trace":"[47489aeb9ba1353816a4d4ab] org.elasticsearch.index.IndexNotFoundException: no such index [47489aeb9ba1353816a4d4ab]
    at org.elasticsearch.server@8.11.3/org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.notFoundException(IndexNameExpressionResolver.java:474)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$ExplicitResourceNameFilter.ensureAliasOrIndexExists(IndexNameExpressionResolver.java:1604)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$ExplicitResourceNameFilter.filterUnavailable(IndexNameExpressionResolver.java:1584)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.resolveExpressions(IndexNameExpressionResolver.java:266)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:341)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:332)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:91)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.resolveConcreteIndexNames(TransportBroadcastByNodeAction.java:225)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecuteForked(TransportBroadcastByNodeAction.java:244)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.lambda$doExecute$1(TransportBroadcastByNodeAction.java:232)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.action.ActionListener.run(ActionListener.java:368)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:96)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
    at org.elasticsearch.server@8.11.3/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
joegallo commented 9 months ago

I opened https://github.com/elastic/elasticsearch/issues/104371 with the relevant details from this investigation, so now I don't feel bad working around that bug in this one test. Fix for this ticket coming right up.