Edge case in metadata population at IndexList instantiation

untergeek commented 1 year ago

Expected Behavior

Curator should not freak out and terminate if an index is deleted by another process.

Actual Behavior

2023-06-13 17:03:12,813 ERROR     Failed to complete action: cold2frozen.  <class 'elasticsearch8.NotFoundError'>: NotFoundError(404, 'index_not_found_exception', 'no such index [WAS_THERE_AT_FIRST]', WAS_THERE_AT_FIRST, index_or_alias)

Steps to Reproduce the Problem

An index that is present during IndexList initialization but deleted before the metadata crawl/population phase will yield this error.

Specifications

Version: 8.0.x
Platform: Linux

Detailed Description

No need for more description.

Ideas for remediation

See about using any include filters in the action to only pull/populate indices which are likely to be acted on. This will speed up initiation of the IndexList object and reduce the likelihood of this error being triggered.

Catch any suddenly missing index in this or any phase. If an index not found error pops up, log what happened, remove that index from the actionable list and/or current "batch" of batching operations. Re-try the failed action or step with the redacted "batch" of the action that failed, or continue to the next index for actions that don't do batching.

earlpotter0 commented 1 year ago

From the current curator run:

2023-06-14 17:03:39,854 DEBUG          curator.indexlist          _get_metadata:179  Getting index metadata
2023-06-14 17:03:39,854 DEBUG          curator.indexlist       empty_list_check:205  Checking for empty list
2023-06-14 17:07:16,076 ERROR                curator.cli      exception_handler:107  Failed to complete action: delete_indices.  <class 'KeyError'>: 'restored-.ds-logs-kubernetes.container_logs-legacy_prod-2023.06.14-000560'

untergeek commented 1 year ago

I think I finally found out what was happening.

Execution flow

`init()`

At instance startup, we call self.__get_indices()

`IndexList.__get_indices()`

As this method starts, we execute helpers.getters.get_indices()

The index at this time is .ds-logs-REDACTED-2023.07.17-001831:

2023-07-17 16:49:50,132 DEBUG     curator.helpers.getters            get_indices:150  All indices: [..., '.ds-logs-REDACTED-2023.07.17-001831', ...]

After this, self.__build_index_info(index) is called for each index, building out the empty dict structure in advance (for each index) so we can merge cluster data without key mismatches. Note that the index is still named .ds-logs-REDACTED-2023.07.17-001831:

2023-07-17 16:49:50,217 DEBUG          curator.indexlist     __build_index_info:83   Building preliminary index metadata for .ds-logs-REDACTED-2023.07.17-001831

At this point, we call self.get_metadata()

2023-07-17 16:49:50,316 DEBUG          curator.indexlist           get_metadata:203  Getting index metadata
2023-07-17 16:49:50,316 DEBUG          curator.indexlist       empty_list_check:233  Checking for empty list

The code in get_metadata() begins collecting the information for each index that had been found in the cluster (I know, this is why this needs to be changed quickly). It is at this point that the error is encountered:

2023-07-17 16:53:33,337 ERROR                curator.cli      exception_handler:132  Failed to complete action: cold2frozen.  <class 'KeyError'>: 'restored-.ds-logs-REDACTED-2023.07.17-001831'

The Smoking Gun

Note the time between log statements:

2023-07-17 16:49:50,316 DEBUG          curator.indexlist           get_metadata:203  Getting index metadata
...
2023-07-17 16:53:33,337 ERROR                curator.cli      exception_handler:132  Failed to complete action: cold2frozen.  <class 'KeyError'>: 'restored-.ds-logs-REDACTED-2023.07.17-001831'

It took nearly 4 minutes to collect the metadata, and that may not even have been complete when it encountered the KeyError.

The "Why"

The Setup

As Elasticsearch promotes indices from hot to cold, it creates a new index with the prefix restored-. As soon as this process is complete, the old index is deleted, and the old index names becomes an alias that points to the new restored- prefixed index mounted as a searchable snapshot on the cold tier.

Somehow, in the 3 minutes and 43 seconds between metadata harvest start and this failure, an index was promoted to the cold tier.

The Problem

During execution of get_metadata() it calls self._get_cluster_state() which in turn calls the elasticseach-py method cluster.state(index=...). The original index name is passed to this call, but it is now an alias. Elasticsearch returns a subset the of the cluster state metadata here:

cluster.state(index=to_csv(data), metric='metadata')['metadata']['indices']

It's returning indices here now, but the index has a different name from the alias that generated the return.

So we submitted the now-alias .ds-logs-REDACTED-2023.07.17-001831 to cluster.state(), and got the index name restored-.ds-logs-REDACTED-2023.07.17-001831 back.

This is where we broke. Because of this, there is no key restored-.ds-logs-REDACTED-2023.07.17-001831 in self.index_info, which should have been populated during self.__build_index_info(). When get_metadata is iterating through the results of the cluster.state call, it tries to merge the metadata of index restored-.ds-logs-REDACTED-2023.07.17-001831 into self.index_info[restored-.ds-logs-REDACTED-2023.07.17-001831], which does not exist.

This is why there is a KeyError.

Proposed Remedy

Alias/Index correction method

Create a method that iterates through self.indices and ensures they are indices and not aliases. If an alias is discovered:

Variables to pass:

The "index" name that resulted from the KeyError.

Log what we're doing
Get the index info for the provided index name. This will contain any aliases.
Iterate through any aliases found
If it is in self.indices, remove the alias from self.indices (it's not an index)
If it is in self.index_info, remove the ALIAS from self.index_info (as an alias, it doesn't belong there)
Add the actual index self.indices
Call self.__build_index_info() for the actual index

Try/Except in `get_metadata()`

`try:`

Merge the cluster state data into self.index_info.

`except KeyError as err:`

Log that we hit a KeyError
Call the alias/index correction method with the index name from this iteration
Re-attempt: sii = self.index_info[index]

Conclusion

This should address the issue raised here.

untergeek commented 1 year ago

Closed by #1685

elastic / curator