IBM / ibm-block-csi-driver

The IBM block storage CSI driver enables container orchestrators, such as Kubernetes and Openshift, to manage the life-cycle of persistent storage
Apache License 2.0
33 stars 25 forks source link

CMMVC7205E The command failed because it is not supported. #717

Closed loopway closed 3 weeks ago

loopway commented 1 month ago

Environment:

IBM FlashSystem 5030 (8.5.0.11) connected with FC to three Bare-metal OpenShift Nodes (4.15.19) with IBM block storage CSI driver operator installed (1.11.3)

Problem Description:

New LUNs get created on storage system, but not mapped to hosts. See errors in logs below.

Logs:

pod event: AttachVolume.Attach failed for volume "pvc-bdf5c3bf-a4d9-4bad-b7ee-3733b5184c20" : rpc error: code = Internal desc = CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"

ibm-block-csi-controller-0 log: 2024-07-08 23:42:19,628 ERROR [140216497063680] [SVC:4;60050763808104F70800000000000035] (exception_handler.py:handle_exception:35) - CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"

host-definer-hostdefiner-59c7c7548c-fm7n7 log: ... 2024-07-08 21:45:51,318 DEBUG [140352619017984] [Thread-9] (utils.py:get_node_id_info:48) - node name : ocp0, nvme_nqn: nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69, fc_wwns : 51402ec01482c418:51402ec01482c41a:51402ec0110ecba4:51402ec0110ecba6, iscsi_iqn : iqn.1994-05.com.redhat:714b6fdda0da ... 2024-07-08 21:45:51,744 ERROR [140352610625280] [Thread-10] (array_mediator_svc.py:_lsnvmefabric:988) - Failed to get nvme fabrics. Reason is: CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"

Configuration

Even though we have set environment CONNECTIVITY_TYPE = fc on the host-definer-hostdefiner, it seems the CSI still tries to make a connection to the LUN over NVMEoverFC and that obviously fails - since its not available in our setup. 🏴‍☠️

kasserater commented 1 month ago

hi @loopway can you please share the hostdefiner custom resource YAML? in that YAML, it should be enough to uncomment the connectivityType: field and set it to fc. is that what you did? see https://raw.githubusercontent.com/IBM/ibm-block-csi-operator/v1.11.3/config/samples/csi_v1_hostdefiner_cr.yaml for a sample YAML that DOES NOT have the comment removed, so you can take that for comparison and perform the changes i mentioned above to resolve

kasserater commented 1 month ago

another option is to disable NVMe on the host side, as hostdefiner detects it and uses it by default, so disabling NVMe on the host will remove that option from the hostdefiner logic decision tree

kasserater commented 1 month ago

please see https://www.ibm.com/docs/en/stg-block-csi-driver/1.11.3?topic=configuring-host-definer

loopway commented 1 month ago

hi @kasserater thanks for your quick follow up. here's our host-definer spec:

...
spec:
  hostDefiner:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/arch
                  operator: In
                  values:
                    - amd64
    allowDelete: true
    connectivityType: fc
    dynamicNodeLabeling: true
    imagePullPolicy: IfNotPresent
    prefix: ocp-01_
    repository: quay.io/ibmcsiblock/ibm-block-csi-host-definer
    tag: 1.11.2
...

Is there a way to force FC only connections with the CSI? If not, we will try to disable the NVMe capabilities of the kernel driver on the hosts - which probably will require a reboot of the hosts...

kasserater commented 1 month ago

hmm, so seems your host-definer spec if properly configured to force the use of FC and not NVMe. that should suffice. with this current spec, you are still encountering issues? if so, can you please provide logs?

loopway commented 1 month ago

yes unfortunately with this spec we get the errors mentioned in the issue description. can you please let me know which logs you would need in addition to the excerpts?

kasserater commented 1 month ago

it would be best if you can supply the HostDefiner pod logs, as well as the IBM CSI controller pod logs

loopway commented 1 month ago

Here are the requested log files. FYI: I replaced our domain with example.com. host-definer-hostdefiner-786c95d95c-xz5m4-ibm-block-csi-host-definer.log ibm-block-csi-controller-0-ibm-block-csi-controller.log

kasserater commented 1 month ago

ok, so here is the issue, as understood from the logs the connectivityType is only utilized when a new host is being defined on the cluster. that said, there is a preliminary step by hostDefiner, checking if a host was was already created on the storage that matches the node's initiators. this check detects the there are NQN on the node, so it checks on the storage side what hosts with those NQNs exist. this is done with the lsnvmefabric command, but since it is not supported on the storage side, the command fails and hostDefiner does into an exception handling branch.

2024-07-17 13:24:26,557 DEBUG [140581538267392] [MainThread] (array_connection_pool.py:create:36) - Creating a new connection for endpoint 5030.prod.example.com 2024-07-17 13:24:26,557 DEBUG [140581538267392] [MainThread] (array_mediator_svc.py:init:266) - in init 2024-07-17 13:24:26,557 DEBUG [140581538267392] [MainThread] (array_mediator_svc.py:_connect:270) - Connecting to SVC 5030.prod.example.com 2024-07-17 13:24:27,576 DEBUG [140581538267392] [MainThread] (utils.py:get_node_id_info:37) - getting node info for node id : ocp0;nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69;51402ec01482c418:51402ec01482c41a:51402ec0110ecba4:51402ec0110ecba6;iqn.1994-05.com.redhat:714b6fdda0da 2024-07-17 13:24:27,577 DEBUG [140581538267392] [MainThread] (utils.py:get_node_id_info:48) - node name : ocp0, nvme_nqn: nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69, fc_wwns : 51402ec01482c418:51402ec01482c41a:51402ec0110ecba4:51402ec0110ecba6, iscsi_iqn : iqn.1994-05.com.redhat:714b6fdda0da 2024-07-17 13:24:27,577 DEBUG [140581538267392] [MainThread] (array_mediator_svc.py:get_host_by_host_identifiers:1043) - Getting host name for initiators : Initiators(nvme_nqns=['nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69'], fc_wwns=['51402ec01482c418', '51402ec01482c41a', '51402ec0110ecba4', '51402ec0110ecba6'], iscsi_iqns=['iqn.1994-05.com.redhat:714b6fdda0da']) 2024-07-17 13:24:27,800 ERROR [140581538267392] [MainThread] (array_mediator_svc.py:_lsnvmefabric:988) - Failed to get nvme fabrics. Reason is: CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'" 2024-07-17 13:24:27,801 ERROR [140581538267392] [MainThread] (host_definer_server.py:define_host:49) - CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'" Traceback (most recent call last): File "/driver/controllers/servers/host_definer/storage_manager/host_definer_server.py", line 31, in define_host found_host_name = self._get_host_name(initiators_from_host_definition, array_mediator) File "/driver/controllers/servers/host_definer/storage_manager/host_definer_server.py", line 75, in _get_host_name found_hostname, = array_mediator.get_host_by_host_identifiers(initiators) File "/driver/controllers/array_action/array_mediator_svc.py", line 1044, in get_host_by_host_identifiers host_names, connectivity_types = self._get_host_names_and_connectivity_types(initiators) File "/driver/controllers/array_action/array_mediator_svc.py", line 1026, in _get_host_names_and_connectivity_types nvme_host_names = self._get_host_names_by_nqn(initiator) File "/driver/controllers/array_action/array_mediator_svc.py", line 997, in _get_host_names_by_nqn nvme_fabrics = self._lsnvmefabric(nqn) File "/driver/controllers/array_action/array_mediator_svc.py", line 990, in _lsnvmefabric raise ex File "/driver/controllers/array_action/array_mediator_svc.py", line 986, in _lsnvmefabric return self.client.svcinfo.lsnvmefabric(remotenqn=host_nqn).as_list File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/client.py", line 139, in call return self.referent(self.context, kwargs) File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/clispec.py", line 211, in call raise e File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/clispec.py", line 207, in call resp = self.resp_helper(resp, extra) File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/response.py", line 84, in init self.result = self.parse(resp, kwargs) File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/response.py", line 118, in parse raise CLIFailureError( pysvc.unified.response.CLIFailureError: CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"

so indeed setting the connectivityType doesn't help. we will need to improve this behavior in a future release

for now, removing the NQNs from the host side should mitigate the issue

kasserater commented 3 weeks ago

fixed in 740addaf98c67eba24e6184551882bd362e9fa03 (will be included in the upcoming 1.12.0 release)