Creating a System Index in a mixed cluster with different mapping versions fail

davidkyle commented 1 month ago

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

Auto creating an internal managed system index fails in a mixed cluster where the nodes have different views of the index mapping version. In this example the .inference index mappings have been updated in the new version. Auto creating the index in a mixed cluster by writing to it sometimes fails with the message:

[auto-create index] failed - system index [.inference] requires all data and master nodes to have mappings versions at least of version [MappingsVersion[version=2, hash=1431251844]]

The message indicates the action is intentional but the failure does not occur every time, sometimes the index is created and other times it is not. If it is intentional to block index creation in a mixed cluster it should do so every time.

SystemIndexMappingUpdateService will update the mappings for the index once all nodes in the cluster have been upgraded. Given that eventually the mappings will be correct is there a reason for blocking the creation in a mixed cluster.

Steps to Reproduce

Any mixed cluster test which create a system index through write to the index could be used. The easiest example I have uses the mixed cluster test suite in the Inference Plugin.

Increment INDEX_MAPPING_VERSION in for the inference secrets index.

https://github.com/elastic/elasticsearch/blob/48ea474953fb365a65dd7f17741ed5d66b6fc269/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferenceSecretsIndex.java#L29

Run the mixed cluster tests with the previous version of Elasticsearch (8.16.0 in this case)
```
./gradlew :x-pack:plugin:inference:qa:mixed-cluster:v8.16.0#javaRestTest
```

The test does not fail every time but over multiple runs it does fail reproducibly. When it fails it will fail with a message like:

[auto-create index] failed - system index [.secrets-inference] requires all data and master nodes to have mappings versions at least of version [MappingsVersion[version=2, hash=-1434574148]]

Logs (if relevant)

No response

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

arpadkiraly commented 1 month ago

We discussed this with @dan-rubinstein over Slack. I understand this bug is blocking https://github.com/elastic/elasticsearch/issues/111856 which is required to allow customers to configure a chunking strategy which would determine how large blocks of text are broken up before being passed to ML models - this is a feature expected to GA in 8.16.

The Core/Infra team has some reservations against the solution of https://github.com/elastic/elasticsearch/pull/112074 as it exposes system index internals that we would like to avoid. We agreed that @elastic/es-core-infra will pick up the bugfix and deliver it before end of September. If we see any risks with delivering the bugfix by end of September, we need to revisit the solution with the ML team.

ldematte commented 1 month ago

Thanks for the detailed info @davidkyle and @dan-rubinstein If I understood it correctly, the bug is that we do not fail creation consistently when in a mixed cluster, and the fix would be to do that (fail every time) - so that the index is never auto-created with a mixed cluster. That way SystemIndexMappingUpdateService will update the mappings for the index only when all nodes in the cluster have been upgraded. Is that correct?

ldematte commented 1 month ago

Another question: do you have a link to some build scans where you saw this failing?

davidkyle commented 1 month ago

If I understood it correctly, the bug is that we do not fail creation consistently when in a mixed cluster, and the fix would be to do that (fail every time) - so that the index is never auto-created with a mixed cluster.

Correct @ldematte, the expectation is that if auto create fails in a mixed cluster it should fail all the time not some of the time.

@arpadkiraly fixing this issue will not unblock https://github.com/elastic/elasticsearch/pull/112074. The problem there is that we cannot create new inference endpoints in a mixed cluster where the index mappings are not up to date. The inference endpoint configurations are stored in the system index which is why we have to write to it when creating a new inference.

We have 2 failure scenarios:

The index does not already exist 1.1. -> Auto create will fail because the nodes have different opinions of the correct mapping version 1.2 -> Effectively the Inference API cannot be used in this scenario as no configurations can be created
The index exists but the mappings have not been updated 2.1. -> The inference code cannot use the new features that require the latest index mappings 2.2. -> But the old configurations can be used. This is better than scenario 1 as at least some features of the Inference API can be used.

These are the problems we have to solve. In https://github.com/elastic/elasticsearch/pull/112074 the code detects the mixed cluster state and either explicitly creates the index or updates the mappings which addresses both scenarios

davidkyle commented 1 month ago

@arpadkiraly The bug we really need to fix for GA is the NPE thrown after upgrading from a version where the system index did not exist https://github.com/elastic/elasticsearch/issues/112694

ldematte commented 1 month ago

After investigating a bit more, I'm not 100% sure this is a bug. Auto create is a TransportMasterNodeAction, so it will always happen on the master node. Suppose we have 2 nodes, like in some of the failing tests:

test-cluster-0: version=8.16.0 (index mapping v1)
test-cluster-1: version=9.0.0 (index mapping v2)

Now, things change depending on the node that is elected master. In both cases, clusterstate will hold correctly the minimum system index mapping as v1.

If test-cluster-0 gets elected as master, auto-create will try to create the system index with the version it knows (v1); we check against the min, we see it's OK, and create it.

If test-cluster-1 gets elected as master, auto-create will try to create the system index with the version it knows (v2); we check against the min, we see it's not OK, and bail out.

ldematte commented 1 month ago

Or better, I'm not sure the bug is where we thought it is; seems like auto create should create the index with a different mapping version in a mixed scenario, but I'm not sure which are the implications. I will keep digging.

rjernst commented 1 month ago

auto create should create the index with a different mapping version in a mixed scenario

Yes, it should. I had thought we handled that by keeping around the old mapping, but I'm not finding that. In order to create the mappings in a mixed cluster, the master node needs to be aware of not only the mappings from its code, but also the mappings associated with the minimum mappings version in the cluster.

ldematte commented 1 month ago

I had thought we handled that by keeping around the old mapping

I found code that seems to imply this, but it seems to be in the wrong place. I'll keep digging.

ldematte commented 1 month ago

So, turns out we are handling this by using the old mapping, but this mapping has to be specified in the system descriptor for us to be able to find it.

For the example given above, this means that we need to declare the system descriptor in this way in InferencePlugin (or something similar):

    @Override
    public Collection<SystemIndexDescriptor> getSystemIndexDescriptors(Settings settings) {

        var inferenceSecretsDescriptorV1 = SystemIndexDescriptor.builder()
            // all the other .setXxx
            // various stuff that might be identical or change
            .setMappings(InferenceSecretsIndex.mappingsV1()) // this definitely change
            // ....
            .build();

        return List.of(
            ...
            SystemIndexDescriptor.builder()
                .setType(SystemIndexDescriptor.Type.INTERNAL_MANAGED)
                .setIndexPattern(InferenceSecretsIndex.INDEX_PATTERN)
                .setPrimaryIndex(InferenceSecretsIndex.INDEX_NAME)
                .setDescription("Contains inference service secrets")
                .setMappings(InferenceSecretsIndex.mappings()) // The new (v2) mappings
                .setSettings(InferenceSecretsIndex.settings())
                .setVersionMetaKey("version")
                .setOrigin(ClientHelper.INFERENCE_ORIGIN)
                .setPriorSystemIndexDescriptors(List.of(inferenceSecretsDescriptorV1)) // <-- HERE
                .setNetNew()
                .build()
        );
    }

where mappingsV1 is something like

public static XContentBuilder mappingsV1() {
        try {
            return jsonBuilder().startObject()
                .startObject(SINGLE_MAPPING_NAME)
                .startObject("_meta")
                .field(SystemIndexDescriptor.VERSION_META_KEY, 1) // notice the fixed "1" here

This way the 9.0.0 cluster will know about both v1 and v2, and will know how to create v1 when in a mixed cluster (choosing it based on the min cluster version). The logic to do that is already there.

I think we can close this as a non-bug; maybe we could improve the error message to point people in the right direction?

ldematte commented 1 month ago

There is one small bug to fix: handle the case for null alias names in prior descriptors. I will have a PR for that. @rjernst let me know what you think about improving the error message. I was thinking about going from this:

[auto-create index] failed - system index [.secrets-inference] requires all data and master nodes to have mappings versions at least of version [MappingsVersion[version=2, hash=-1434574148]]

to this

[auto-create index] failed - requested creation of system index [.secrets-inference] with version [MappingsVersion[version=2, hash=-1434574148]], which is greater than the cluster minimum index mapping version. Ensure that the system index descriptor for [.secrets-inference] includes a prior definition for version [MappingsVersion[version=1, hash=-1434574148]]

Wdyt?

rjernst commented 1 month ago

+1 to the improved error message

davidkyle commented 1 month ago

.setPriorSystemIndexDescriptors(List.of(inferenceSecretsDescriptorV1)) // <-- HERE

Thanks @ldematte, sorry for the confusion I'm really happen you found the problem.

Once the above is implemented we can safely assume that in a mixed cluster the mappings will be the previous version and auto create won't fail. This is good because we can remove the checks we added to prevent the auto create failure.

ldematte commented 1 month ago

Once the above is implemented we can safely assume that in a mixed cluster the mappings will be the previous version and auto create won't fail. This is good because we can remove the checks we added to prevent the auto create failure.

Yes exactly 👍

davidkyle commented 1 month ago

I opened https://github.com/elastic/elasticsearch/pull/113274 adding some detail to the Javadoc on using prior index descriptors

ldematte commented 1 month ago

@davidkyle I opened https://github.com/elastic/elasticsearch/pull/113278 to improve messaging and cover the case in which an alias is null in a prior descriptor (which was/is the case for inference service secrets "v1")

elastic / elasticsearch