Test kdump in a lab cluster

BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)

Apache License 2.0

8 stars 17 forks source link

Test kdump in a lab cluster #4843

Closed StevenBarre closed 1 month ago

StevenBarre commented 4 months ago

Describe the issue When a kernel panic happens (rarely) and causes a node to reboot, it would be helpful if a dump of the kernel could be captured for analysis by Red Hat.

What is the Value/Impact? Improved debugging ability

What is the plan? How will this get completed? Kdump on OCP Docs: https://docs.openshift.com/container-platform/4.13/support/troubleshooting/troubleshooting-operating-system-issues.html

General kdump docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/managing_monitoring_and_updating_the_kernel/configuring-kdump-on-the-command-line_managing-monitoring-and-updating-the-kernel

Identify any dependencies None

Definition of done

[x] Kdump configured in a lab
[x] a plan to roll out to production
[x] Update RHCOS docs to explain how to do testing for this type of thing

tbaker1313 commented 4 months ago

Have tested a POC to enable this in KLAB, working as expected.

vivekratan88 commented 3 months ago

shadowed Tim for this

tbaker1313 commented 3 months ago

Need to update the platform-ops repo as well as the DXC github ansible role that configures the cluster, will need to create a new user as well as it's private/public keys, then the public key needs to be added to the users .ssh/known_hosts file

vivekratan88 commented 3 months ago

went on call with Tim and it seems like a new user with its own set of keys need to be made and placed in the correct ssh file.

tbaker1313 commented 2 months ago

Progress is being made towards completing this next week.

tbaker1313 commented 2 months ago

Rolled out to KLAB but not completed an action plan or playbook to move this to production.

wmhutchison commented 2 months ago

Reviewing current state of things.

At present, KLAB has a Proof of Concept implementation made via a manually-created machineconfig policy titled 99-worker-kdump which is using a team member's SSH keypair to connect to the KLAB UTL server.

Proposed movement forward:

Create a new branch/PR for the platform-ops repo to store required changes and apply commits there while working on this in KLAB.
Review internal password credentials for a more generic SSH account to use versus a team member's SSH keypair.
Inject required secure values and similar variables into ansible vault files and other inventory files as needed.
Add via templates suitable new machineconfig resources for master and worker nodes, populating values from ansible-vault/etc as needed. Make sure the SSH account used has a suitable directory in /data/ to write to versus the default home directory.
Test as needed in KLAB by re-running the machineconfig role from the PR/branch.
Promote PR when testing is complete for review/approval.

wmhutchison commented 2 months ago

Also am thinking an additional requirement/thing to add is daily AAP-driven check on all clusters to see if there are any new kdumps. This ensures that VMs in LAB don't get missed for an actual kdump event since smaller nodes/VMs might be quite able to restart/etc without triggering anything else.

For this we'd have a directory structure for the kdump so that new dump files go into a "new" subfolder, and as we investigate, we just move such files out of that new folder to not have the check bug us about them anymore without removing the dump file entirely.

wmhutchison commented 2 months ago

Will start work next week on pouring content found for this into a suitable branch/PR, after the status of SILVER Trident state is confirmed, that gets top priority next week in terms of resourcing as required.

wmhutchison commented 2 months ago

about ready to start carving out PR content, just fleshing out what goes where for variables (regular inventory versus protected in ansible-vault) and what's going to be a regular file versus template.