elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.56k stars 24.35k forks source link

rolling-upgrade:v8.12.0#oneThirdUpgradedTest IllegalStateException: failed to obtain node locks #101231

Open stu-elastic opened 8 months ago

stu-elastic commented 8 months ago

CI Link

https://gradle-enterprise.elastic.co/s/augsybdqwff3i

Repro line

:x-pack:plugin:shutdown:qa:rolling-upgrade:v8.12.0-1

Does it reproduce?

Didn't try

Applicable branches

main

Failure history

No response

Failure excerpt

» [2023-10-23T18:50:26,585][ERROR][o.e.b.Elasticsearch      ] [v8.12.0-1] fatal exception while booting Elasticsearch java.lang.IllegalStateException: failed to obtain node locks, tried [/dev/shm/elastic+elasticsearch+main+intake+multijob+bwc-snapshots/x-pack/plugin/shutdown/qa/rolling-upgrade/build/testclusters/v8.12.0-1/data]; maybe these locations are not writable or multiple nodes were started on the same data path?
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:297)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.construct(NodeConstruction.java:484)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:244)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:181)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:236)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:236)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:73)
»  Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /dev/shm/elastic+elasticsearch+main+intake+multijob+bwc-snapshots/x-pack/plugin/shutdown/qa/rolling-upgrade/build/testclusters/v8.12.0-1/data/node.lock
»   at org.apache.lucene.core@9.8.0/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:117)
»   at org.apache.lucene.core@9.8.0/org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)
»   at org.apache.lucene.core@9.8.0/org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:235)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:209)
»   at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:289)
»   ... 6 more
»  
»  ERROR: Elasticsearch did not exit normally - check the logs at /dev/shm/elastic+elasticsearch+main+intake+multijob+bwc-snapshots/x-pack/plugin/shutdown/qa/rolling-upgrade/build/testclusters/v8.12.0-1/logs/v8.12.0.log
»  
»  ERROR: Elasticsearch exited unexpectedly, with exit code 1
elasticsearchmachine commented 8 months ago

Pinging @elastic/es-delivery (Team:Delivery)

mark-vieira commented 8 months ago

@breskeby this sounds like it might be related to https://github.com/elastic/elasticsearch/pull/101069. Looking at the cluster logs it looks like we're attempting to start an already started cluster, which would explain the error above. Perhaps the updated logic is losing track of clusters that are used across multiple tasks as is the case for many BWC tests. My guess is some state is getting confused when we upgrade nodes in a cluster.

breskeby commented 8 months ago

@mark-vieira From a brief look at the logic we changed and the project in question I couldn't see how that change affected this and wasn't able to reproduce. I'll have another fresh look tomorrow. as indeed it seems related that we see this failure after making the change we did in #101069

mark-vieira commented 6 months ago

@breskeby looks like this is still happening occasionally: https://github.com/elastic/elasticsearch/issues/103839

slobodanadamovic commented 5 months ago

Another failure today: https://gradle-enterprise.elastic.co/s/izhi63q6ustnw

joegallo commented 5 months ago

And another: https://gradle-enterprise.elastic.co/s/6ey6xm4uylriy

Note that this one was a failure of x-pack:plugin:eql:qa:ccs-rolling-upgrade:v8.13.0#oneThirdUpgraded, though, not the specific test indicated in the issue description. The "failed to obtain node locks" error and stack trace are present, though, so I thought it was fair to attach onto this one.

martijnvg commented 3 months ago

I ran into this failure in a pr: https://gradle-enterprise.elastic.co/s/lwkluhs5zpwf6/console-log?page=3#L2846 I also noticed that it happened today on the main branch: https://gradle-enterprise.elastic.co/s/e4ca4ihilzigw/console-log?page=2#L1183

iverase commented 3 months ago

Another one today: https://gradle-enterprise.elastic.co/s/coocr6hsiw7ny

williamrandolph commented 3 months ago

We had one in the intake build on 17 March: https://gradle-enterprise.elastic.co/s/a4567iuaplgju/

benwtrent commented 3 months ago

Here is another intake build failure due to this: https://gradle-enterprise.elastic.co/s/h4gi5trbgx5rk

All three nodes crashed due to failing to obtain locks on their data paths.

mark-vieira commented 3 months ago

I think we want to move this pull request forward. The downside is it'll probably make the test execution a bit slower but I think the improvement in stability is probably worth it. I'll pick this back up.

DaveCTurner commented 2 months ago

https://gradle-enterprise.elastic.co/s/l34azxsevqole looks like another instance of this

nik9000 commented 2 months ago

https://gradle-enterprise.elastic.co/s/plqah3hab6t7e/console-log/raw

bpintea commented 2 months ago

https://gradle-enterprise.elastic.co/s/6qixyhadhi67a

kkrik-es commented 2 months ago

Another one: https://gradle-enterprise.elastic.co/s/q3tppdm2xbp3m

davidkyle commented 2 months ago

Same error for :qa:ccs-rolling-upgrade-remote-cluster:v8.15.0#twoThirdUpgraded

https://gradle-enterprise.elastic.co/s/nprfknz2niwso