Closed StevenBarre closed 1 month ago
Have tested a POC to enable this in KLAB, working as expected.
shadowed Tim for this
Need to update the platform-ops repo as well as the DXC github ansible role that configures the cluster, will need to create a new user as well as it's private/public keys, then the public key needs to be added to the users .ssh/known_hosts file
went on call with Tim and it seems like a new user with its own set of keys need to be made and placed in the correct ssh file.
Progress is being made towards completing this next week.
Rolled out to KLAB but not completed an action plan or playbook to move this to production.
Reviewing current state of things.
At present, KLAB has a Proof of Concept implementation made via a manually-created machineconfig policy titled 99-worker-kdump
which is using a team member's SSH keypair to connect to the KLAB UTL server.
Proposed movement forward:
Also am thinking an additional requirement/thing to add is daily AAP-driven check on all clusters to see if there are any new kdumps. This ensures that VMs in LAB don't get missed for an actual kdump event since smaller nodes/VMs might be quite able to restart/etc without triggering anything else.
For this we'd have a directory structure for the kdump so that new dump files go into a "new" subfolder, and as we investigate, we just move such files out of that new folder to not have the check bug us about them anymore without removing the dump file entirely.
Will start work next week on pouring content found for this into a suitable branch/PR, after the status of SILVER Trident state is confirmed, that gets top priority next week in terms of resourcing as required.
about ready to start carving out PR content, just fleshing out what goes where for variables (regular inventory versus protected in ansible-vault) and what's going to be a regular file versus template.
Started a possible PR for this, still need to fix up the variables as I'm certain they aren't done correctly but as a placeholder, they will work until I can figure it out. New branch is https://github.com/bcgov-c/platform-ops/tree/node-kdump
The files section of the 99-worker and 99-master files will need to be adjusted to use the source files. A little unfamiliar with how these files are declared so will research what ansible wants in the source
fields.
Fixed some commits done with the wrong username due to a configuration mistake.
One possible way of monitoring the /data/kdump directory is using a cronjob like this:
export DEAD=/dev/null ;/usr/bin/find /data/kdump -type f | xargs -r | mail -s "Kdump files found in /data/kdump on $(hostname -s)" blah@dxcas.com 2> /dev/null
Updated klab2 and emerald vault files.
Added some directions for how to test in the RHCOS README.md file.
Tested and was successful, however a small change to how to manage the dump permissions caused one node to enter a degraded state and haven't been able to fix it. Opened RH Case 03898782 to investigate further.
Another possible way to create a script to mail about kdumps is:
for entry in $(/usr/bin/find /data/kdump -type f); do echo $entry; done | xargs -r mail -s "test from $(hostname -s)" name@dxcas.com
Completed in https://github.com/bcgov-c/platform-ops/pull/500 - opening a new ticket to deal with permissions and notify about any kdump files.
Describe the issue When a kernel panic happens (rarely) and causes a node to reboot, it would be helpful if a dump of the kernel could be captured for analysis by Red Hat.
What is the Value/Impact? Improved debugging ability
What is the plan? How will this get completed? Kdump on OCP Docs: https://docs.openshift.com/container-platform/4.13/support/troubleshooting/troubleshooting-operating-system-issues.html
General kdump docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/managing_monitoring_and_updating_the_kernel/configuring-kdump-on-the-command-line_managing-monitoring-and-updating-the-kernel
Identify any dependencies None
Definition of done