fabric8io / openshift-elasticsearch-plugin

Apache License 2.0
27 stars 21 forks source link

Clusters that take a while to recover or can't recover past red may not initialize #39

Closed ewolinetz closed 7 years ago

ewolinetz commented 8 years ago

https://github.com/fabric8io/openshift-elasticsearch-plugin/blob/master/src/main/java/io/fabric8/elasticsearch/plugin/acl/DynamicACLFilter.java#L357

This line may prevent us from ever initializing the Searchguard config. We should ensure the cluster is up and active at least, we may not need/want to necessarily wait until yellow to start seeding.

We should, however, ensure that the index we create is yellow before we continue with seeding.

ewolinetz commented 8 years ago

Can recreate this by deleting crucial index files while ES is down, forcing that index to not recover fully.

[2016-09-08 15:16:17,588][WARN ][indices.cluster          ] [Merlin] [[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][0]] marking and sending shard failed due to [failed recovery]
[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][0]] IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, segments_3]]; nested: NoSuchFileException[/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08/0/index/_0.si];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:224)
    at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
    at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: [rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][0]] IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, segments_3]]; nested: NoSuchFileException[/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08/0/index/_0.si];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:208)
    ... 5 more
Caused by: java.nio.file.NoSuchFileException: /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08/0/index/_0.si
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
    at java.nio.channels.FileChannel.open(FileChannel.java:287)
    at java.nio.channels.FileChannel.open(FileChannel.java:335)
    at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81)
    at org.apache.lucene.store.FileSwitchDirectory.openInput(FileSwitchDirectory.java:186)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:89)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:89)
    at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:109)
    at org.apache.lucene.codecs.lucene50.Lucene50SegmentInfoFormat.read(Lucene50SegmentInfoFormat.java:82)
    at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:362)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:493)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:490)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:731)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
    at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:490)
    at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:95)
    at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:163)
    at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:148)
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:199)
    ... 5 more
[2016-09-08 15:16:17,613][WARN ][cluster.action.shard     ] [Merlin] [rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][0] received shard failed for target shard [[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][0], node[1qlu7yi2SkKAIVom4HpMTw], [P], v[7], s[INITIALIZING], a[id=C8w7VeEMT_aeQgcGAwTzFg], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-09-08T15:16:16.834Z]]], indexUUID [YEH4LxehSgC5Zf0u-Lsf6A], message [failed recovery], failure [IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, segments_3]]; nested: NoSuchFileException[/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08/0/index/_0.si]; ]
[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][0]] IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, segments_3]]; nested: NoSuchFileException[/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08/0/index/_0.si];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:224)
    at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
    at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: [rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][[rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08][0]] IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, segments_3]]; nested: NoSuchFileException[/elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08/0/index/_0.si];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:208)
    ... 5 more
Caused by: java.nio.file.NoSuchFileException: /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/rails.d7dec85c-75d1-11e6-9a7c-0efd4172c6af.2016.09.08/0/index/_0.si
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
    at java.nio.channels.FileChannel.open(FileChannel.java:287)
    at java.nio.channels.FileChannel.open(FileChannel.java:335)
    at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81)
    at org.apache.lucene.store.FileSwitchDirectory.openInput(FileSwitchDirectory.java:186)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:89)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:89)
    at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:109)
    at org.apache.lucene.codecs.lucene50.Lucene50SegmentInfoFormat.read(Lucene50SegmentInfoFormat.java:82)
    at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:362)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:493)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:490)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:731)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
    at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:490)
    at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:95)
    at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:163)
    at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:148)
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:199)
    ... 5 more

[2016-09-08 15:16:49,622][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-08 15:16:50,759][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-08 15:16:51,879][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-08 15:16:52,990][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-08 15:16:54,104][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-08 15:16:55,217][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-08 15:16:56,327][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
ewolinetz commented 7 years ago

IIRC this has been resolved by #40