Closed thorn126 closed 2 years ago
The operator relies on the process group list in the FoundationDBClusterStatus to match old IPs and new IPs. What do you see in that status? Is it finding the new pods and associated the new IPs with the process groups?
[1000710000@ibm-fdb-controller-manager-589977d9c6-dl2cx /]$ kubectl fdb fix-coordinator-ips -c mdm-foundationdb-ibm -n testoperator1
2022/08/11 03:43:40 Could not find process for coordinator IP 10.254.15.127:4500:tls
2022/08/11 03:43:40 Could not find process for coordinator IP 10.254.23.21:4500:tls
2022/08/11 03:43:40 Could not find process for coordinator IP 10.254.18.78:4500:tls
2022/08/11 03:43:40 New connection string: mdm_foundationdb_ibm:VSbRIVjesEmm4ytD5WDNkiKKLBLPWRdh@10.254.15.127:4500:tls,10.254.23.21:4500:tls,10.254.18.78:4500:tls
status:
configured: true
connectionString: mdm_foundationdb_ibm:VSbRIVjesEmm4ytD5WDNkiKKLBLPWRdh@10.254.15.127:4500:tls,10.254.23.21:4500:tls,10.254.18.78:4500:tls
databaseConfiguration:
log_routers: -1
redundancy_mode: double
remote_logs: -1
storage_engine: ssd-2
usable_regions: 1
generations:
hasUnhealthyProcess: 9
missingDatabaseStatus: 9
needsConfigurationChange: 9
needsCoordinatorChange: 9
health: {}
locks: {}
needsNewCoordinators: true
processCounts: {}
processGroups:
- addresses:
- 10.254.15.125
- 10.254.15.133
- 10.254.15.139
processClass: log
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: log-1
- addresses:
- 10.254.23.22
processClass: log
processGroupConditions:
- timestamp: 1660188773
type: MissingProcesses
processGroupID: log-2
- addresses:
- 10.254.18.80
- 10.254.18.82
- 10.254.18.84
processClass: log
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: log-3
- addresses:
- 10.254.15.126
- 10.254.15.134
- 10.254.15.140
processClass: log
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: log-4
- addresses:
- 10.254.15.123
- 10.254.15.135
- 10.254.15.143
processClass: proxy
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: proxy-1
- addresses:
- 10.254.23.20
processClass: proxy
processGroupConditions:
- timestamp: 1660188773
type: MissingProcesses
processGroupID: proxy-2
- addresses:
- 10.254.15.132
- 10.254.15.136
- 10.254.15.142
processClass: stateless
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: stateless-1
- addresses:
- 10.254.15.131
- 10.254.15.137
- 10.254.15.141
processClass: storage
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: storage-1
- addresses:
- 10.254.23.23
- 10.254.23.24
- 10.254.23.25
processClass: storage
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: storage-2
- addresses:
- 10.254.18.81
- 10.254.18.83
- 10.254.18.85
processClass: storage
processGroupConditions:
- timestamp: 1660188781
type: MissingProcesses
processGroupID: storage-3
requiredAddresses:
tls: true
runningVersion: 6.2.29
storageServersPerDisk:
- 1
oc get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
apple-fdb-controller-manager-66c455f677-n4b25 1/1 Running 0 11m 10.254.15.138 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
ibm-fdb-controller-manager-589977d9c6-dl2cx 1/1 Running 0 23m 10.254.15.116 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-fdb-backup-agents-77fb5c97cf-b9hhf 1/1 Running 0 16m 10.254.15.128 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-log-1 2/2 Running 0 11m 10.254.15.139 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-log-2 2/2 Running 0 16m 10.254.23.22 worker1.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-log-3 2/2 Running 0 11m 10.254.18.84 worker0.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-log-4 2/2 Running 0 11m 10.254.15.140 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-proxy-1 2/2 Running 0 11m 10.254.15.143 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-proxy-2 2/2 Running 0 16m 10.254.23.20 worker1.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-stateless-1 2/2 Running 0 11m 10.254.15.142 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-storage-1 2/2 Running 0 11m 10.254.15.141 worker2.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-storage-2 2/2 Running 0 11m 10.254.23.25 worker1.fdbtest3.cp.fyre.ibm.com <none> <none>
mdm-foundationdb-ibm-storage-3 2/2 Running 0 11m 10.254.18.85 worker0.fdbtest3.cp.fyre.ibm.com <none> <none>
I saw this
type: MissingProcesses
@brownleej let me know if you need more debug info. thank you for looking into this. It is blocking us.
It's interessing that the coordinator IPs are not part of the addresses
in the process group status. Do you know what exactly has happened to the cluster?
I am not sure @johscheuer .
I saw this https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/customization.md#service-ips
You can choose this option by setting spec.routing.publicIPSource=service. This feature is new, and still experimental, but we plan to make it the default in the future.
It has been a while, I remember I saw it in 0.48, is it still experimental or just the doc didn't update? could we use it instead of worrying about this fix ip command?
I have also reproduce this issues. I have a cluster with double redundancy and I delete two of the storage pods out of 3 from the gui. Two new storage pods have spawned back. The pvc are still there and then I run the fix-coordinators manually and I get the message that fdb cannot find the process which is based on storage 1 ip which got deleted and has new ip. Since that process is gone, it will never be able to find it. The new pod has new ip and an updated fdbmonitor.conf that uses new ip. My cluster cannot comes back since missing this last coordinator even though now 2 out of 3 is reachable.
Here is a status json output from my cluster: cat jj.txt { "client" : { "cluster_file" : { "path" : "/var/dynamic-conf/fdb.cluster", "up_to_date" : true }, "coordinators" : { "coordinators" : [ { "address" : "10.254.20.38:4500:tls", "reachable" : false }, { "address" : "10.254.28.32:4500:tls", "reachable" : true }, { "address" : "10.254.28.33:4500:tls", "reachable" : true } ], "quorum_reachable" : true }, "database_status" : { "available" : false, "healthy" : false }, "messages" : [ { "description" : "Unable to locate a cluster controller within 2 seconds. Check that there are server processes running.", "name" : "no_cluster_controller" } ], "timestamp" : 1662078548 }, "cluster" : { "layers" : { "_valid" : false } } } Basically the fix-coordinators still trying to figure out where 10.254.20.38:4500:tls should be but that pod address has been changed to something else after the pod got deleted and started. So, not sure why it still trying to get to it.
And it seems to me due to that pod is down, the controller cannot be reach any more.
I created a follow up issue for this: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/1351 once we implement this we can detect coordinators even if the IP is missing in the ProcessGroup status.
I'm going to close this issue since we have a follow up and there is currently nothing we can do (except for implementing the referenced feature).
What happened?
When I delete more than half of the pods, it will trigger recovery process, coordinator ip changes, then I run
Got Could not find process error, it sounds apple operator is trying the old coodinator ips
What did you expect to happen?
It should correct the coordinator with the newly created ip
How can we reproduce it (as minimally and precisely as possible)?
Originally the coordinator ips are:
then delete some pods
check pods now, their ip changed
now run fix ip command
Anything else we need to know?
No response
FDB Kubernetes operator
Kubernetes version
Cloud provider