[Other] Yaook SCS cluster debugging

SovereignCloudStack / standards

SCS standards in a machine readable format

https://scs.community/

Creative Commons Attribution Share Alike 4.0 International

30 stars 21 forks source link

[Other] Yaook SCS cluster debugging #556

Closed cah-hbaum closed 6 days ago

cah-hbaum commented 2 months ago

This issue contains information/problems/data about debugging and working with the Yaook SCS cluster. It will be closed, when the parent issue is closed.

See #426

anjastrunk commented 2 months ago

I suggest to log/fix each bug/problem in a separate issue, as done on #557 and list these issues in #426 in section "bug fixing". @cah-hbaum What do you think?

cah-hbaum commented 2 months ago

No I think that would be too much overhead for no gain. I would just log everything here in separate comments and link issues or similar things, if they're created externally.

I could've also done this in the separate issues already available for each standard, but most (or better all) bugs and problems are cluster-related and not specific to a standard.

cah-hbaum commented 2 months ago

08-04-2024 The virtualized Yaook cluster broke over the weekend. The exact reason isn't really known, but the problem was with multiple Openstack volumes managed by the Openstack CSI Cinder driver not being detached correctly. They would just hang around infinitely. Since our Openstack policy doesn't allow resetting volume states by users, I would have needed to involve our Operations team with this. The problem could have been stemming from the fact, that one of the worker nodes wasn't in a ready state, so the ceph instance couldn't run on it, which probably prevented the detaching of volumes.

I tried to reset the Kubernetes cluster with yaook/k8s; this ejected the worker node, since the process failed because of problems with the two of the master nodes and couldn't finish rejoining the previously bad worker. The master nodes were having problems with connecting to different debian repositories, probably because of high resource usage on the nodes.

After loosing a second master node, I decided to just reset the cluster completely, meaning deletion of all resources and a new cluster setup.

cah-hbaum commented 2 months ago

Had some problems with the new cluster, images couldn't seemingly be uploaded, neither from local files nor from a linked location. This turned out to be a problem with glance and its secret containing the connection information to ceph. The secret wasn't copied correctly into the other namespace, resulting in an incorrect key distributed to glance, which then couldn't access ceph.

cah-hbaum commented 2 months ago

Problems are fixed for now (already applied the fixes on friday). The problem seemed to come from incorrectly created roles for the neutron-ovn-operator initially. After I fixed them manually, the ovnagents seemed to be the problem. They were created without the status key, because it wasn't available in the CRD. I needed to manually update the CRD and fix the ovnagents. After that was done, the cluster was running correctly.

cah-hbaum commented 1 month ago

Addendum from last week (~15.05.2024):

I tried to setup yaook/k8s in order to test the Kubernetes standards on an independent cluster, which isn't in use by an overlying setup like yaook/operator.

To do this, I updated my already existing yaook/k8s git repository and pulled the latest version available. This version was released after the so called core-split, which essentially reworked the structure of the repository as well as the cluster build processes.

With this new version, everything went smooth until the calico-api-server should come up. This wasn't possible, due to the taints NoSchedule not being removed from the Worker nodes. I couldn't find a reason why this was the case, so I removed them manually, which helped finish the setup process. This setup was then tested via the test script, which went through without problem.

cah-hbaum commented 4 weeks ago

Tried to setup the yaook/k8s cluster again last week, since the nodes wouldn't come out of the NotInitialized state. With the help of a colleague, I found out that this is probably a problem with the OpenstackCloudController component. Still this didn't throw any obvious errors. Im gonna investigate further this week.

cah-hbaum commented 6 days ago

I would close this issue, the clusters seem stable for quite some time now and other problems are reported in other issues.