Documentation - Add notes for checking NetApp during Node Rebuilds

BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)

Apache License 2.0

8 stars 17 forks source link

Documentation - Add notes for checking NetApp during Node Rebuilds #4907

Closed wmhutchison closed 2 months ago

wmhutchison commented 3 months ago

Describe the issue A rare edge case which has come up again a few times over the years involves a case where deleting an Openshift node does not correctly trigger a removal of that node's iSCSI initiator from all initator groups (igroups) inside NetApp, thus the rebuilt node does not allow block storage to be created. This ticket aims to capture the essentials of this use case so that future situations where this happens will be correctly handled.

Additional context Add any other context, attachments or screenshots

Blocked By Need to review and see if the use case this ticket is based on even applies anymore, since original use case involved an older version of Trident not understanding newer per-node igroups and orphaned LUN mappings within them. Recent upgrade work with Trident may have made this now moot. Will finish evaluation within the week and likely close this ticket off. A bit of a waste of time, but better that than creating docs for a scenario which can no longer take place since Trident upgrade work will involve fixing up all inconsistencies outside of a node rebuild event.

How does this benefit the users of our platform? Being able to handle rare edge cases ensures timely rebuild of production nodes if required without being stuck.

Definition of done

[ ] Archive current knowledge into a new documentation PR.;
[ ] Merge PR once reviewed and approved.

wmhutchison commented 3 months ago

The core for the recent issue encountered in CLAB was that when you delete a node from Openshift, the corresponding NetApp device should not have any initiators involving the deleted node still present, as they should have been removed by Trident, assuming you did the right thing in draining the node to remove storage off the node first.

This ticket will focus on this, since enough existing text/logs were saved regarding this scenario, and in the event this happens again, this ticket will have this documented so that troubleshooting efforts will be much faster.

wmhutchison commented 3 months ago

The following is a sample of what can be expected when attempting to mount block storage when testing a rebuilt node where something is amiss. This comes from the Events section when doing an oc describe pod <pod trying to mount block storage>.

  Warning  FailedMount         88s                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[block], unattached volumes=[block kube-api-access-hfhqq postgresql-data]: timed out waiting for the condition
  Warning  FailedAttachVolume  78s (x9 over 3m30s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-<redacted>" : rpc error: code = Internal desc =
unable to update node access rules on backend netapp-block-standard
; error adding IQN iqn.2020-02.ca.bc.gov.devops.clab:mcs-clab-app-02 to igroup trident-<redacted>: error adding IQN iqn.2020-02.ca.bc.gov.devops.clab:mcs-clab-app-02 to igroup trident-<redacted>: API status: failed, Reason:
You cannot add the initiator to an initiator group if the initiator is already present in another initiator group and both initiator groups have the same LUN map, Code: 9009

This scenario can sometimes result when during the process of deleting a node, that Trident does not remove 100% of all instances of this node's initiator from all matching initiator groups (or igroup). Thus, after deleting a node, one will need to SSH into the NetApp instance and review all of the initiator groups defined and confirm that the node's initiator is also removed, and if not, troubleshoot and resolve this before attempting a new node rebuild.

wmhutchison commented 3 months ago

Adding some notes regarding specific NetApp commands.

To show all igroups on the NetApp instance. lun igroup show

To remove an initiator from an igroup. It will not work if LUNs are still mapped to it. lun igroup remove -igroup <> -initiator <>

To see what LUNs are still mapped to a specific igroup. lun mapping show -igroup <igroup name>

To see where a specific LUN path is mapped to regarding igroup. lun mapping show -path </vol/...>

wmhutchison commented 3 months ago

Right now CLAB and KLAB are in a weird state due to having upgraded and then downgraded Trident between versions which changed how it created igroups. A good "next step" is to go through the process of rebuilding another CLAB node where there exists a newer-version naming igroup syntax which has valid LUNs currently mapped to it. We want to track if the node deletion will correctly place the involved LUN maps into an expected igroup while removing it from the unsupported igroup, or if it will create a duplicate LUN mapping where one of these mappings will need to be manually cleaned up.

The duplicate LUN mapping I believe was encountered in the past when we'd bug Storage about a block volume not mounting, but was reactive in nature since we had to wait for the volume in question be mounted, and wasn't good for scenarios where some block volumes weren't mounted due to their workloads having been spun down (like some DB pods).

wmhutchison commented 3 months ago

Been working on this offline in Notepad so that the bulk of content can be formulated before moving to VSC and carving out a suitable PR. One last thing to do before VSC-time is to do a couple more CLAB node rebuilds for nodes not yet touched in hopes of triggering some of the previous issues. Want to capture as much as possible in this wave, but don't want this to turn into a time-sink of its own.

wmhutchison commented 2 months ago

Ugh. Paring back almost all of what was captured for the last round of CLAB node rebuilds, since the use case we can reliably currently replicate is going to go away in LAB next week as we start moving forward for Trident upgrades. If all goes to plan, we'll just be left with per-node igroups, which means both that the likelyhood of us replicating our current issues all but drops to nil, and if it does happen, then it's more straight forward in terms of what we have to do which is nuke the node-specific igroup/initiator if trident doesn't do it for us.

Cleaning up raw texts to match this new situation where we just need valid samples of using the various NetApp cli commands, which will simplify this process a lot for figuring out troubleshooting steps. Will then do a node rebuild after we get to the desired Trident version in LAB to see if we can still replicate problems or not. If replication is still possible, then grab the new details for feeding this doc. Otherwise, leave it at "use these specific commands to remove an initiator and/or igroup, and deal with orphaned LUN mappings if they still exist since those block removal"

wmhutchison commented 2 months ago

An unexpected family emergency consumed an entire weeks worth of time that would have otherwise been sunk into completing this. Thus this ticket will extend into the following sprint, and will prioritize completing this early next sprint.

wmhutchison commented 2 months ago

Adding to the mix here are all of the recent NetApp emails involving LAB clusters and recent Trident upgrades. Need to ensure any relevant snippets are also included in documentation since a quick glance shows a few CLI commands noteworthy of documentation that were not yet considered.

wmhutchison commented 2 months ago

Going to have to re-evaluate before moving forward on this specific ticket, since based on what I've worked on to date, appears to be for a use case which is frankly going away even as we speak since the previous issues for node rebuilds were due to an older Trident not knowing how to deal with the newer per-node igroups, but we're actively moving forward now with a version of Trident which does understand these. Going to put this ticket in Blocked for now and work on other tickets and continue to monitor ongoing Trident work. I do not want to cause confusion to create documentation for troubleshooting a use case which is or already has gone away. To properly document what I set out to do, would need me to replicate and resolve a node-rebuild issue with the upgraded Trident.

wmhutchison commented 2 months ago

Closing this off. The documentation really was truly tailored to the unique situation on both KLAB and CLAB, since even SILVER's igroup listing wasn't the same. Once we finish upgrading Trident across the board to per-node igroups, then we'll most likely remain there, thus making any other portions of the docs moot unless Netapp expects to make another significant back-end change, which is highly doubtful. The change made already was an understandable one and as demonstrated, not a trivial one to implement, thus cannot see Netapp making further changes anytime soon to the new setup.

Closing this ticket.