Closed Zarquan closed 1 month ago
@DavidFair @meoflynn
Do you have any docs for your deployments on STFC Cloud that you could contribute to the project? Would be much appreciated.
We are trying to make a code base that is portable across all the available clouds.
Our deployment works on the Arcus Openstack system at Cambridge. It sometimes works on the Somerville Openstack system at Edinburgh, but no one knows why. https://github.com/lsst-uk/somerville-operations/issues/144
Several people have contributed suggestions, including a couple of people from StackHPC, but we haven't managed to identify the cause yet.
We are not experts on Openstack or ClusterAPI. We are developers trying to build our services on top of Kubernetes, so we don't have a lot of time to dedicate to debugging this. It would be really useful to have some unit tests and health diagnostics written Openstack and ClusterAPI experts that we could run on a platform to help diagnose what is going wrong.
Note - the problem is not a fault with the the capi-helm-charts themselves. The assumption is that the underlying cause lies with the configuration of the Openstack platform. The problem is that the capi-helm-charts Helm charts don't provide any feedback about what is wrong.
Within the Azimuth context, we do have these docs, which help debug cloud issues: https://stackhpc.github.io/azimuth-config/debugging/kubernetes/#zenith-service-issues
I am curious if you think they help at all for capi-helm chart users or not. The above is somewhat assuming the management cluster is deployed using our regular scripts that setup centralized logging, monitoring and alerting, to help pin point the issues: https://github.com/stackhpc/ansible-collection-azimuth-ops/blob/main/playbooks/provision_capi_mgmt.yml https://github.com/stackhpc/azimuth-config/tree/main/environments/capi-mgmt-example
We have been talking about separating out the CAPI helm chart bits in there, in particular to help Magnum users and standalone users. Although we don't that that work currently scheduled (or funded) right now. Obviously suggestions and contributes are very welcome.
When these charts work, they are fine. When they don't work, the user is left in a world of hurt, rummaging round in a complex system trying to find clues as to what has gone wrong.
For a production system intended for others to use, these charts should come with a set of unit tests and debug tools that can be run on a cluster to check all the components are present and correct.
A simple library of
kubectl
commands to look for obvious things would be a useful start. So rather than posting suggestions in a Slack channel, StackHPC can point people at a page of common debug queries they can copy and paste.These scripts should include
kubectl
queries to resolve the random Pod names so that they will 'just work' without requiring the user to resolve the names manually.These will be useful for people trying to port the Helm charts to new platforms, and for system admins trying to verify if their Openstack platform is working correctly.