Closing an index with unassigned shards leaves it in an inconsistent state

Copied from this comment: if an index has unassigned shards then we cannot complete the coordinated flush and block that happens before closing the index, leaving these shards with translog to replay if/when they recover:

java.lang.AssertionError: max seq. no. [419] does not match [325]
    at __randomizedtesting.SeedInfo.seed([F11392CA75807FB3]:0)
    at org.elasticsearch.index.engine.ReadOnlyEngine.assertMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:194)
    at org.elasticsearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:180)
    at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:122)
    at org.elasticsearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:63)
    at org.elasticsearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:49)
    at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2005)
    at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1969)
    at org.elasticsearch.index.shard.StoreRecovery.lambda$internalRecoverFromStore$9(StoreRecovery.java:479)
    at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:236)
    at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onResponse(ActionListenerImplementations.java:182)
    at org.elasticsearch.index.CompositeIndexEventListener.callListeners(CompositeIndexEventListener.java:278)
    at org.elasticsearch.index.CompositeIndexEventListener.iterateBeforeIndexShardRecovery(CompositeIndexEventListener.java:287)
    at org.elasticsearch.index.CompositeIndexEventListener.beforeIndexShardRecovery(CompositeIndexEventListener.java:314)
    at org.elasticsearch.index.shard.IndexShard.preRecovery(IndexShard.java:1681)
    at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:410)
    at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:92)
    at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2338)
    at org.elasticsearch.action.ActionRunnable$3.doRun(ActionRunnable.java:73)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1623)

This reproduces fairly readily as follows:

public void testCloseWhileIndexingDocumentsAndRestartingNode() throws Exception {
    final var dataNode1 = internalCluster().startDataOnlyNode();
    final var dataNode2 = internalCluster().startDataOnlyNode();

    final String indexName = randomAlphaOfLength(10).toLowerCase(Locale.ROOT);
    createIndex(
        indexName,
        indexSettings(1, 1).putList(IndexMetadata.INDEX_ROUTING_INCLUDE_GROUP_PREFIX + "._name", dataNode1, dataNode2)
            .put(SETTING_WAIT_FOR_ACTIVE_SHARDS.getKey(), "0")
            .build()
    );
    ensureGreen(indexName);

    try (BackgroundIndexer indexer = new BackgroundIndexer(indexName, client(), MAX_DOCS)) {
        indexer.setFailureAssertion(t -> {
            if ((t instanceof NodeClosedException
                || t instanceof EsRejectedExecutionException
                || t instanceof UnavailableShardsException) == false) {
                assertException(t, indexName);
            }
        });
        waitForDocs(randomIntBetween(10, 50), indexer);

        internalCluster().restartNode(dataNode1, new InternalTestCluster.RestartCallback() {
            @Override
            public Settings onNodeStopped(String nodeName) {
                try {
                    internalCluster().restartNode(dataNode2, new InternalTestCluster.RestartCallback() {
                        @Override
                        public Settings onNodeStopped(String nodeName) {
                            try {
                                indexer.stopAndAwaitStopped();
                                assertTrue(indicesAdmin().prepareClose(indexName).setWaitForActiveShards(0).get().isAcknowledged());
                            } catch (Exception e) {
                                throw new AssertionError(e);
                            }
                            return Settings.EMPTY;
                        }
                    });
                } catch (Exception e) {
                    throw new AssertionError(e);
                }
                return Settings.EMPTY;
            }
        });
    }

    assertIndexIsClosed(indexName);
    ensureGreen(indexName);
    assertAcked(indicesAdmin().prepareOpen(indexName));
    ensureGreen(indexName);
}

This is a bit of a tricky one, we might be closing an index with unassigned shards because we want to recover it from a snapshot, so we can't just reject close requests until the index is healthy.

This is a bit of a tricky one, we might be closing an index with unassigned shards because we want to recover it from a snapshot, so we can't just reject close requests until the index is healthy.

Henning and I discussed this further. We identified three specific uses for the close index API today, and think we see better alternatives in all cases:

Non-dynamic `IndexMetadata` updates

Users will close an index in order to adjust the index metadata in a way that we do not (or cannot) support on an open index. Notably this includes changing non-dynamic index settings, but possibly there are other reasons for an index metadata update to require us to restart the whole IndexService and all the shards.

In this case we think there's no particularly good reason to close the index as a separate step before adjusting the metadata. We believe it should work to adjust the metadata update and mark all the shards as UNASSIGNED in the routing table as a single cluster state update. We think we could add a ?reopen=true option to the update-index-settings API (and any other relevant APIs) to allow users to indicate that they want the metadata update to also restart the index, much as if closing it and reopening it except without the extra API calls or tricky intermediate states.

Note that the close-adjust-reopen pattern today allows a sequence of multiple adjustments to happen in between the close and reopen steps, and that would not be possible with a one-shot call to the update-index-settings API with ?reopen=true. For index settings this should be ok, but if there are other metadata adjustments to support then we would need an API that allows to combine them all into a single transaction.

Recovery of existing indices from snapshots

Users will also close an index in order to overwrite it with a copy from a snapshot. The restore process is triggered by a metadata update which marks the closed index as open and simultaneously marks all its shards as UNASSIGNED in the routing table as a single cluster state update, adjusting the recovery source of all the primaries to refer to the snapshot.

In this case we see a little UX value in closing the index first: it prevents users from accidentally overwriting existing indices with an overly-inclusive wildcard in the restore pattern. However there's no fundamental reason to require the index to be closed first, and we think it would work to prevent accidental overwrites by requiring an explicit list of indices/patterns to overwrite in the restore request instead of specifying them in earlier close-index requests.

Unfollowing a CCR leader[^1]

We must reinitialize everything from cold when converting a CCR follower index into a regular read-write index via the /_ccr/unfollow API. Today we achieve this by requiring the index to be closed when calling this API, but effectively this is a special case of the non-dynamic IndexMetadata update mentioned above and again we think we could add a ?reopen=true option to the API to do the update and reinitialization in a single step.

Deprecation of close-index API

Once we have implemented alternatives to all the known use-cases for the close index API, we would be able to deprecate it and remove it in some future major release.

[^1]: Edited to add this case 2024-01-09.

elastic / elasticsearch