BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Test kdump in a lab cluster #4843

Closed StevenBarre closed 1 month ago

StevenBarre commented 4 months ago

Describe the issue When a kernel panic happens (rarely) and causes a node to reboot, it would be helpful if a dump of the kernel could be captured for analysis by Red Hat.

What is the Value/Impact? Improved debugging ability

What is the plan? How will this get completed? Kdump on OCP Docs: https://docs.openshift.com/container-platform/4.13/support/troubleshooting/troubleshooting-operating-system-issues.html

General kdump docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/managing_monitoring_and_updating_the_kernel/configuring-kdump-on-the-command-line_managing-monitoring-and-updating-the-kernel

Identify any dependencies None

Definition of done

tbaker1313 commented 4 months ago

Have tested a POC to enable this in KLAB, working as expected.

vivekratan88 commented 3 months ago

shadowed Tim for this

tbaker1313 commented 3 months ago

Need to update the platform-ops repo as well as the DXC github ansible role that configures the cluster, will need to create a new user as well as it's private/public keys, then the public key needs to be added to the users .ssh/known_hosts file

vivekratan88 commented 3 months ago

went on call with Tim and it seems like a new user with its own set of keys need to be made and placed in the correct ssh file.

tbaker1313 commented 2 months ago

Progress is being made towards completing this next week.

tbaker1313 commented 2 months ago

Rolled out to KLAB but not completed an action plan or playbook to move this to production.

wmhutchison commented 2 months ago

Reviewing current state of things.

At present, KLAB has a Proof of Concept implementation made via a manually-created machineconfig policy titled 99-worker-kdump which is using a team member's SSH keypair to connect to the KLAB UTL server.

Proposed movement forward:

  1. Create a new branch/PR for the platform-ops repo to store required changes and apply commits there while working on this in KLAB.
  2. Review internal password credentials for a more generic SSH account to use versus a team member's SSH keypair.
  3. Inject required secure values and similar variables into ansible vault files and other inventory files as needed.
  4. Add via templates suitable new machineconfig resources for master and worker nodes, populating values from ansible-vault/etc as needed. Make sure the SSH account used has a suitable directory in /data/ to write to versus the default home directory.
  5. Test as needed in KLAB by re-running the machineconfig role from the PR/branch.
  6. Promote PR when testing is complete for review/approval.
wmhutchison commented 2 months ago

Also am thinking an additional requirement/thing to add is daily AAP-driven check on all clusters to see if there are any new kdumps. This ensures that VMs in LAB don't get missed for an actual kdump event since smaller nodes/VMs might be quite able to restart/etc without triggering anything else.

For this we'd have a directory structure for the kdump so that new dump files go into a "new" subfolder, and as we investigate, we just move such files out of that new folder to not have the check bug us about them anymore without removing the dump file entirely.

wmhutchison commented 2 months ago

Will start work next week on pouring content found for this into a suitable branch/PR, after the status of SILVER Trident state is confirmed, that gets top priority next week in terms of resourcing as required.

wmhutchison commented 2 months ago

about ready to start carving out PR content, just fleshing out what goes where for variables (regular inventory versus protected in ansible-vault) and what's going to be a regular file versus template.

tbaker1313 commented 1 month ago

Started a possible PR for this, still need to fix up the variables as I'm certain they aren't done correctly but as a placeholder, they will work until I can figure it out. New branch is https://github.com/bcgov-c/platform-ops/tree/node-kdump

The files section of the 99-worker and 99-master files will need to be adjusted to use the source files. A little unfamiliar with how these files are declared so will research what ansible wants in the source fields.

tbaker1313 commented 1 month ago

Fixed some commits done with the wrong username due to a configuration mistake.

tbaker1313 commented 1 month ago

One possible way of monitoring the /data/kdump directory is using a cronjob like this:

export DEAD=/dev/null ;/usr/bin/find /data/kdump -type f | xargs -r | mail -s "Kdump files found in /data/kdump on $(hostname -s)" blah@dxcas.com 2> /dev/null
tbaker1313 commented 1 month ago

Updated klab2 and emerald vault files.

tbaker1313 commented 1 month ago

Added some directions for how to test in the RHCOS README.md file.

tbaker1313 commented 1 month ago

Tested and was successful, however a small change to how to manage the dump permissions caused one node to enter a degraded state and haven't been able to fix it. Opened RH Case 03898782 to investigate further.

tbaker1313 commented 1 month ago

Another possible way to create a script to mail about kdumps is:

for entry in $(/usr/bin/find /data/kdump -type f); do echo $entry; done | xargs -r mail -s "test from $(hostname -s)" name@dxcas.com
tbaker1313 commented 1 month ago

Completed in https://github.com/bcgov-c/platform-ops/pull/500 - opening a new ticket to deal with permissions and notify about any kdump files.