Closed gbbirkisson closed 3 years ago
This may be due to new timings in Felix 7.0 because initialization code is a copy/paste of audit log initialization.
No, actually other nodes would eventually see that repository in initialized and start working. Since they fail it means that repository was really half initialized by previous failed initializations and after that can't progress anymore.
Cluster got in serious trouble before, and could not complete repository initialization. Reason for trouble - simultaneous nodes reconfiguration in cluster during an update, so only half of required writes were done.
We probably could make the initialization more robust, but in general it would not help, because updates during unhealthy cluster are unpredictable.
I'll leave this issue open - to investigate, maybe we can eliminate problematic repository entry service and cache. But in normal conditions this issue should not appear.
Workaround for such cases - restore from snapshot.
Just had this problem with one of our customer clusters. I was very careful when restarting the cluster, so I am dumbfounded why this could have happened. The issue was fixed by deleting system.scheduler index and then restarting the master nodes again.
This process needs to be more robust, i.e. check if index is present before creating it.
Fin part: there is a check if index already exists. Is it 7.7.1 or 7.7.0 ? Is it Ansible? Are all nodes started at the same exact moment? Was it anyhow a "rolling update"?
XP version 7.7.1. Not ansible but the old setup. Not a rolling update, since HZ did not allow that for this update (7.6.1 to 7.7.1). All nodes were stopped, XP updated, master nodes started first, then data nodes, then frontend nodes.
We get this problem on a customer using Ansible as well. New cluster environment, loading in old XP 7.4.1 data into version 7.6.1 works fine, but when upgrading to 7.7.1 we get the same "Could not initialize" error as mentioned above, and the cluster gives us status 503 service unavailable. Funny thing is, we can actually log in to admin on individual data nodes on port 8080, but then there seems to be no indexed data (e.g. no apps after logging in) despite the data being present before the upgrade to 7.7.1.
@sigdestad and I tried to delete the scheduler repo index using elasticsearch command, and the command completes, but after restarting the cluster, we get the same problem once again.
Question is how to get into half initialized state. IT should not be possible in normal conditions. For now the idea is the index gets created automatically when first data gets inserted (but what tries to insert it?!)
If I change the code to tolerate index existence various other problems may arise, because it may be an index with incorrect mappings (if created automatically) So, it is better safe than sorry.
Bottom line: I need logs from servers to see what created that half initialized repo at first place.
One way to get half initialized state is to do a snapshot restore on 7.7 from a snapshot done on version 7.6 or earlier.
system repository on 7.6 has no record about scheduler index, but index is certainly there (because snapshot restore does not delete existing indices)
The main fix is to, right after ES snapshot restore, check for orphan indices (that were missing in snapshot) and delete them.
Other tunings:
Master node fails creating repo:
other nodes are left in limbo: