Putting index settings in twice causes ES to crash from assertion error

dakrone commented 1 year ago

Elasticsearch Version

8.8.0-SNAPSHOT

Installed Plugins

No response

Java Version

bundled

OS Version

n/a

Problem Description

It's possible to cause an assert failure by putting the same index settings into an index twice.

Steps to Reproduce

When running on the main branch, with ./gradlew run (so that asserts are enabled), it is possible to easily crash Elasticsearch with the following:

PUT /foo

PUT /foo/_settings
{
  "index.auto_expand_replicas": "0-all",
  "index.number_of_replicas": 1
}

PUT /foo/_settings
{
  "index.auto_expand_replicas": "0-all",
  "index.number_of_replicas": 1
}

Putting the same settings in twice causes this assertion to trip:

[2023-04-18T14:35:46,835][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [runTask-0] fatal error in thread [elasticsearch[runTask-0][clusterApplierService#updateTask][T#1]], exiting java.lang.AssertionError: Index updates are expected as index settings version has changed
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.index.IndexService.updateMetadata(IndexService.java:806)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.indices.cluster.IndicesClusterStateService.updateIndices(IndicesClusterStateService.java:539)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:248)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:536)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:522)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:495)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:426)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217)
    at org.elasticsearch.server@8.8.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

Logs (if relevant)

No response

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

thecoop commented 1 year ago

The reason this assertion is tripping is because ClusterChangedEvent.indexMetadataChanged uses reference equality test - the identity has changed - but the next bit of code to run in IndexService.updateMetadata uses a value equality test - no settings have actually changed. So the index service gets confused as to why it has actually been called, the version has updated, but no settings have changed.

So the best thing to do here might be to just remove these assertions - its ok for the settings version to increment without anything changing. But this is more distributed's area, so pinging them for a second opinion

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 1 year ago

It has been suggested that we reject attempts to set the number of replicas while auto_expand_replicas is in play: https://github.com/elastic/elasticsearch/issues/27835. I think that's the only way you can hit this. I even opened a PR to show how we could deprecate that behaviour in https://github.com/elastic/elasticsearch/pull/73230. I'd prefer that we at least ignore updates to number_of_replicas in this situation rather than accepting them and then reverting them, and leave the assertion alone, but I would be ok with weakening the assertion in this situation too. I'd rather not remove it tho.

That said, this is definitely a :Core/Infra/Settings question IMO

rjernst commented 1 year ago

@DaveCTurner This doesn't seem to have to do with the specific settings set in this issue, but in general with how the settings differences are calculated, as @thecoop noted. Since the assertions in question live in IndexService, it seems to me to be more of a question for the Distributed team, though this is probably a gray area. Are you comfortable removing the assertions?

andreidan commented 1 year ago

@rjernst @DaveCTurner Would it help if I add another reproduction case here?

Namely, trying to modify a data stream to replace an existing backing index with a downsample index will run into this problem as well (https://github.com/elastic/elasticsearch/pull/91141/files#diff-3525330b076f1366498e871381f6f4a3fe7e0127b55e863227a3b98cf29d7e6cR128 )

The workaround is not great https://github.com/elastic/elasticsearch/pull/91141/files#diff-3525330b076f1366498e871381f6f4a3fe7e0127b55e863227a3b98cf29d7e6cR110

dakrone commented 1 year ago

Is there any update on this? It still reproduces on main.

elastic / elasticsearch