Curator reindex KeyError randomly

ksemaev commented 4 years ago

To submit a bug or report an issue

When doing reindex, occasionally in 5-7% of cases I get Failed to complete action: reindex. <type 'exceptions.KeyError'>: 'indices'

Expected Behavior

The task I have is quite simple - I get all index names that are readonly, and then one by one for each index:

I create I new index with needed settings;
I start reindex task;
I delete source index; I do it with AWS Elasticsearch. It works as it should, but sometimes fails.

Actual Behavior

The task runs fine, but sometimes (I could only connect it with big number of running reindex tasks) I get:

2020-01-15 11:11:27,316 INFO      Trying Action ID: 18, "reindex": Reindex ark-r2-node-2020.01.14
2020-01-15 11:11:29,081 ERROR     Failed to complete action: reindex.  <type 'exceptions.KeyError'>: 'indices'

As I have a lot of reindex tasks - I get such error 5-7 times out of 100 (and this is oncce in an hour). There's no pattern, it could be any index at any time. The index could be 30Mb, could be 40Gb. It always reindexes from 3 primary 1 replica to 1 primary 1 replica index.

Specifications

AWS ES 7.1 Curator 5.8.1

Context (Environment)

Can we somehow catch this error and do retry? Maybe it is happening because index for reindex is in process of creation, and we can add delay option? Or maybe there's even a way to catch the verbose output to define the reason?

untergeek commented 4 years ago

It's a deep rabbit hole, but in order to troubleshoot that, you'd need to disable log blacklisting in your client config yaml file by setting it to an empty array, e.g.:

logging:
  # Whatever other entries, followed by:
  blacklist: []

After that, you'd collect the logs and watch what happens. My suspicion here is that you'll find the cluster state is not updating rapidly enough, or some similar problem. The "KeyError" exception indicates that the response Curator got back from Elasticsearch did not include a list of indices. What this might imply is that your action steps (which you did not share, so this is a guess) are "completed," but the cluster state hasn't updated to show that the index has been created. Sometimes it is, but that 5%-7% of the time it fails, it isn't.

Again, this is just a guess, since I'd need to see the log files to be completely certain. But it is what I suspect, based on what you've shared.

ksemaev commented 4 years ago

TY for the response @untergeek ! I will create the debug process in next days, please do not close the issue. But overall I think that indeed it's AWS ES doesn't update state, that's why I ask if it's possible to add the same delay option that we have for forcemerge https://www.elastic.co/guide/en/elasticsearch/client/curator/current/option_delay.html to reindex/all_other actions. Or maybe at least catch the error of target index not existing

elastic / curator