FoundationDB / fdb-kubernetes-operator

A kubernetes operator for FoundationDB
Apache License 2.0
245 stars 82 forks source link

Execute fix-coordinator-ips command got “Could not find process for coordinator IP..” error #1314

Closed thorn126 closed 2 years ago

thorn126 commented 2 years ago

What happened?

When I delete more than half of the pods, it will trigger recovery process, coordinator ip changes, then I run

/usr/local/bin/kubectl fdb fix-coordinator-ips -c mdm-foundationdb-ibm -n testoperator1
2022/08/04 14:34:32 Could not find process for coordinator IP 10.254.14.134:4500:tls
2022/08/04 14:34:32 Could not find process for coordinator IP 10.254.22.126:4500:tls
2022/08/04 14:34:32 Could not find process for coordinator IP 10.254.17.233:4500:tls
2022/08/04 14:34:32 New connection string: mdm_foundationdb_ibm:l17ltLSvQmntzYlkmifsd2X28mLGpj0s@10.254.14.134:4500:tls,10.254.22.126:4500:tls,10.254.17.233:4500:tls

Got Could not find process error, it sounds apple operator is trying the old coodinator ips

What did you expect to happen?

It should correct the coordinator with the newly created ip

How can we reproduce it (as minimally and precisely as possible)?

Originally the coordinator ips are:

  coordinators:
    - 10.254.14.134
    - 10.254.22.126
    - 10.254.17.233
oc get po
NAME                                                      READY   STATUS      RESTARTS   AGE
apple-fdb-controller-manager-66c455f677-9stzd             1/1     Running     0          7h37m
ibm-fdb-controller-manager-589977d9c6-w5tmb               1/1     Running     0          7h41m
mdm-foundationdb-ibm-fdb-backup-agents-6c6cfd48bf-84sdd   1/1     Running     0          7h35m
mdm-foundationdb-ibm-fdb-restore-job-l9m55                0/1     Completed   0          7h36m
mdm-foundationdb-ibm-log-1                                2/2     Running     0          7h37m
mdm-foundationdb-ibm-log-2                                2/2     Running     0          7h37m
mdm-foundationdb-ibm-log-3                                2/2     Running     0          7h37m
mdm-foundationdb-ibm-log-4                                2/2     Running     0          7h37m
mdm-foundationdb-ibm-proxy-1                              2/2     Running     0          7h37m
mdm-foundationdb-ibm-proxy-2                              2/2     Running     0          7h37m
mdm-foundationdb-ibm-stateless-1                          2/2     Running     0          7h37m
mdm-foundationdb-ibm-storage-1                            2/2     Running     0          7h37m
mdm-foundationdb-ibm-storage-2                            2/2     Running     0          7h37m
mdm-foundationdb-ibm-storage-3                            2/2     Running     0          7h37m

then delete some pods

oc delete po mdm-foundationdb-ibm-log-1 mdm-foundationdb-ibm-log-2 mdm-foundationdb-ibm-log-3 mdm-foundationdb-ibm-log-4 mdm-foundationdb-ibm-storage-1 mdm-foundationdb-ibm-storage-2 mdm-foundationdb-ibm-storage-3 mdm-foundationdb-ibm-stateless-1 mdm-foundationdb-ibm-proxy-1 mdm-foundationdb-ibm-proxy-2 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "mdm-foundationdb-ibm-log-1" force deleted
pod "mdm-foundationdb-ibm-log-2" force deleted
pod "mdm-foundationdb-ibm-log-3" force deleted
pod "mdm-foundationdb-ibm-log-4" force deleted
pod "mdm-foundationdb-ibm-storage-1" force deleted
pod "mdm-foundationdb-ibm-storage-2" force deleted
pod "mdm-foundationdb-ibm-storage-3" force deleted
pod "mdm-foundationdb-ibm-stateless-1" force deleted
pod "mdm-foundationdb-ibm-proxy-1" force deleted
pod "mdm-foundationdb-ibm-proxy-2" force deleted

check pods now, their ip changed

oc get po
NAME                                                      READY   STATUS      RESTARTS   AGE
apple-fdb-controller-manager-66c455f677-9stzd             1/1     Running     0          7h40m
ibm-fdb-controller-manager-589977d9c6-w5tmb               1/1     Running     0          7h44m
mdm-foundationdb-ibm-fdb-backup-agents-6c6cfd48bf-84sdd   1/1     Running     0          7h38m
mdm-foundationdb-ibm-fdb-restore-job-l9m55                0/1     Completed   0          7h39m
mdm-foundationdb-ibm-log-1                                2/2     Running     0          2m
mdm-foundationdb-ibm-log-2                                2/2     Running     0          2m
mdm-foundationdb-ibm-log-3                                2/2     Running     0          2m
mdm-foundationdb-ibm-log-4                                2/2     Running     0          2m
mdm-foundationdb-ibm-proxy-1                              2/2     Running     0          2m
mdm-foundationdb-ibm-proxy-2                              2/2     Running     0          2m
mdm-foundationdb-ibm-stateless-1                          2/2     Running     0          2m
mdm-foundationdb-ibm-storage-1                            2/2     Running     0          2m
mdm-foundationdb-ibm-storage-2                            2/2     Running     0          2m
mdm-foundationdb-ibm-storage-3                            2/2     Running     0          2m

 oc get po -o wide
NAME                                                      READY   STATUS      RESTARTS   AGE     IP              NODE                               NOMINATED NODE   READINESS GATES
apple-fdb-controller-manager-66c455f677-gf2f7             1/1     Running     0          5s      10.254.14.184   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
ibm-fdb-controller-manager-589977d9c6-w5tmb               1/1     Running     0          7h44m   10.254.14.123   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-fdb-backup-agents-6c6cfd48bf-84sdd   1/1     Running     0          7h39m   10.254.14.137   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-fdb-restore-job-l9m55                0/1     Completed   0          7h39m   10.254.14.136   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-1                                2/2     Running     0          2m18s   10.254.14.178   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-2                                2/2     Running     0          2m18s   10.254.22.128   worker1.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-3                                2/2     Running     0          2m18s   10.254.17.234   worker0.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-4                                2/2     Running     0          2m18s   10.254.14.179   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-proxy-1                              2/2     Running     0          2m18s   10.254.14.180   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-proxy-2                              2/2     Running     0          2m18s   10.254.22.129   worker1.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-stateless-1                          2/2     Running     0          2m18s   10.254.14.182   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-storage-1                            2/2     Running     0          2m18s   10.254.14.181   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-storage-2                            2/2     Running     0          2m18s   10.254.22.130   worker1.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-storage-3                            2/2     Running     0          2m18s   10.254.17.235   worker0.fdbtest3.cp.fyre.ibm.com   <none>           <none>

now run fix ip command

sh-4.4$ /usr/local/bin/kubectl fdb fix-coordinator-ips -c mdm-foundationdb-ibm -n testoperator1
2022/08/04 14:34:32 Could not find process for coordinator IP 10.254.14.134:4500:tls
2022/08/04 14:34:32 Could not find process for coordinator IP 10.254.22.126:4500:tls
2022/08/04 14:34:32 Could not find process for coordinator IP 10.254.17.233:4500:tls
2022/08/04 14:34:32 New connection string: mdm_foundationdb_ibm:l17ltLSvQmntzYlkmifsd2X28mLGpj0s@10.254.14.134:4500:tls,10.254.22.126:4500:tls,10.254.17.233:4500:tls

Anything else we need to know?

No response

FDB Kubernetes operator

```console $ kubectl fdb version 0.48.0 ```

Kubernetes version

```console $ kubectl version Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5+3afdacb", GitCommit:"3c28e7a79b58e78b4c1dc1ab7e5f6c6c2d3aedd3", GitTreeState:"clean", BuildDate:"2022-05-10T16:30:48Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider

OCP 4.10
brownleej commented 2 years ago

The operator relies on the process group list in the FoundationDBClusterStatus to match old IPs and new IPs. What do you see in that status? Is it finding the new pods and associated the new IPs with the process groups?

thorn126 commented 2 years ago
[1000710000@ibm-fdb-controller-manager-589977d9c6-dl2cx /]$ kubectl fdb fix-coordinator-ips -c mdm-foundationdb-ibm -n testoperator1
2022/08/11 03:43:40 Could not find process for coordinator IP 10.254.15.127:4500:tls
2022/08/11 03:43:40 Could not find process for coordinator IP 10.254.23.21:4500:tls
2022/08/11 03:43:40 Could not find process for coordinator IP 10.254.18.78:4500:tls
2022/08/11 03:43:40 New connection string: mdm_foundationdb_ibm:VSbRIVjesEmm4ytD5WDNkiKKLBLPWRdh@10.254.15.127:4500:tls,10.254.23.21:4500:tls,10.254.18.78:4500:tls
status:
  configured: true
  connectionString: mdm_foundationdb_ibm:VSbRIVjesEmm4ytD5WDNkiKKLBLPWRdh@10.254.15.127:4500:tls,10.254.23.21:4500:tls,10.254.18.78:4500:tls
  databaseConfiguration:
    log_routers: -1
    redundancy_mode: double
    remote_logs: -1
    storage_engine: ssd-2
    usable_regions: 1
  generations:
    hasUnhealthyProcess: 9
    missingDatabaseStatus: 9
    needsConfigurationChange: 9
    needsCoordinatorChange: 9
  health: {}
  locks: {}
  needsNewCoordinators: true
  processCounts: {}
  processGroups:
  - addresses:
    - 10.254.15.125
    - 10.254.15.133
    - 10.254.15.139
    processClass: log
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: log-1
  - addresses:
    - 10.254.23.22
    processClass: log
    processGroupConditions:
    - timestamp: 1660188773
      type: MissingProcesses
    processGroupID: log-2
  - addresses:
    - 10.254.18.80
    - 10.254.18.82
    - 10.254.18.84
    processClass: log
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: log-3
  - addresses:
    - 10.254.15.126
    - 10.254.15.134
    - 10.254.15.140
    processClass: log
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: log-4
  - addresses:
    - 10.254.15.123
    - 10.254.15.135
    - 10.254.15.143
    processClass: proxy
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: proxy-1
  - addresses:
    - 10.254.23.20
    processClass: proxy
    processGroupConditions:
    - timestamp: 1660188773
      type: MissingProcesses
    processGroupID: proxy-2
  - addresses:
    - 10.254.15.132
    - 10.254.15.136
    - 10.254.15.142
    processClass: stateless
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: stateless-1
  - addresses:
    - 10.254.15.131
    - 10.254.15.137
    - 10.254.15.141
    processClass: storage
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: storage-1
  - addresses:
    - 10.254.23.23
    - 10.254.23.24
    - 10.254.23.25
    processClass: storage
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: storage-2
  - addresses:
    - 10.254.18.81
    - 10.254.18.83
    - 10.254.18.85
    processClass: storage
    processGroupConditions:
    - timestamp: 1660188781
      type: MissingProcesses
    processGroupID: storage-3
  requiredAddresses:
    tls: true
  runningVersion: 6.2.29
  storageServersPerDisk:
  - 1
 oc get po -o wide
NAME                                                      READY   STATUS    RESTARTS   AGE   IP              NODE                               NOMINATED NODE   READINESS GATES
apple-fdb-controller-manager-66c455f677-n4b25             1/1     Running   0          11m   10.254.15.138   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
ibm-fdb-controller-manager-589977d9c6-dl2cx               1/1     Running   0          23m   10.254.15.116   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-fdb-backup-agents-77fb5c97cf-b9hhf   1/1     Running   0          16m   10.254.15.128   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-1                                2/2     Running   0          11m   10.254.15.139   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-2                                2/2     Running   0          16m   10.254.23.22    worker1.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-3                                2/2     Running   0          11m   10.254.18.84    worker0.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-log-4                                2/2     Running   0          11m   10.254.15.140   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-proxy-1                              2/2     Running   0          11m   10.254.15.143   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-proxy-2                              2/2     Running   0          16m   10.254.23.20    worker1.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-stateless-1                          2/2     Running   0          11m   10.254.15.142   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-storage-1                            2/2     Running   0          11m   10.254.15.141   worker2.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-storage-2                            2/2     Running   0          11m   10.254.23.25    worker1.fdbtest3.cp.fyre.ibm.com   <none>           <none>
mdm-foundationdb-ibm-storage-3                            2/2     Running   0          11m   10.254.18.85    worker0.fdbtest3.cp.fyre.ibm.com   <none>           <none>

I saw this type: MissingProcesses

thorn126 commented 2 years ago

@brownleej let me know if you need more debug info. thank you for looking into this. It is blocking us.

johscheuer commented 2 years ago

It's interessing that the coordinator IPs are not part of the addresses in the process group status. Do you know what exactly has happened to the cluster?

thorn126 commented 2 years ago

I am not sure @johscheuer . I saw this https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/customization.md#service-ips You can choose this option by setting spec.routing.publicIPSource=service. This feature is new, and still experimental, but we plan to make it the default in the future. It has been a while, I remember I saw it in 0.48, is it still experimental or just the doc didn't update? could we use it instead of worrying about this fix ip command?

tangerine-tt commented 2 years ago

I have also reproduce this issues. I have a cluster with double redundancy and I delete two of the storage pods out of 3 from the gui. Two new storage pods have spawned back. The pvc are still there and then I run the fix-coordinators manually and I get the message that fdb cannot find the process which is based on storage 1 ip which got deleted and has new ip. Since that process is gone, it will never be able to find it. The new pod has new ip and an updated fdbmonitor.conf that uses new ip. My cluster cannot comes back since missing this last coordinator even though now 2 out of 3 is reachable.

tangerine-tt commented 2 years ago

Here is a status json output from my cluster: cat jj.txt { "client" : { "cluster_file" : { "path" : "/var/dynamic-conf/fdb.cluster", "up_to_date" : true }, "coordinators" : { "coordinators" : [ { "address" : "10.254.20.38:4500:tls", "reachable" : false }, { "address" : "10.254.28.32:4500:tls", "reachable" : true }, { "address" : "10.254.28.33:4500:tls", "reachable" : true } ], "quorum_reachable" : true }, "database_status" : { "available" : false, "healthy" : false }, "messages" : [ { "description" : "Unable to locate a cluster controller within 2 seconds. Check that there are server processes running.", "name" : "no_cluster_controller" } ], "timestamp" : 1662078548 }, "cluster" : { "layers" : { "_valid" : false } } } Basically the fix-coordinators still trying to figure out where 10.254.20.38:4500:tls should be but that pod address has been changed to something else after the pod got deleted and started. So, not sure why it still trying to get to it.

tangerine-tt commented 2 years ago

And it seems to me due to that pod is down, the controller cannot be reach any more.

johscheuer commented 2 years ago

I created a follow up issue for this: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/1351 once we implement this we can detect coordinators even if the IP is missing in the ProcessGroup status.

I'm going to close this issue since we have a follow up and there is currently nothing we can do (except for implementing the referenced feature).