Closed loopway closed 3 weeks ago
hi @loopway
can you please share the hostdefiner custom resource YAML?
in that YAML, it should be enough to uncomment the connectivityType:
field and set it to fc
. is that what you did?
see https://raw.githubusercontent.com/IBM/ibm-block-csi-operator/v1.11.3/config/samples/csi_v1_hostdefiner_cr.yaml for a sample YAML that DOES NOT have the comment removed, so you can take that for comparison and perform the changes i mentioned above to resolve
another option is to disable NVMe on the host side, as hostdefiner detects it and uses it by default, so disabling NVMe on the host will remove that option from the hostdefiner logic decision tree
hi @kasserater thanks for your quick follow up. here's our host-definer spec:
...
spec:
hostDefiner:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- amd64
allowDelete: true
connectivityType: fc
dynamicNodeLabeling: true
imagePullPolicy: IfNotPresent
prefix: ocp-01_
repository: quay.io/ibmcsiblock/ibm-block-csi-host-definer
tag: 1.11.2
...
Is there a way to force FC only connections with the CSI? If not, we will try to disable the NVMe capabilities of the kernel driver on the hosts - which probably will require a reboot of the hosts...
hmm, so seems your host-definer spec if properly configured to force the use of FC and not NVMe. that should suffice. with this current spec, you are still encountering issues? if so, can you please provide logs?
yes unfortunately with this spec we get the errors mentioned in the issue description. can you please let me know which logs you would need in addition to the excerpts?
it would be best if you can supply the HostDefiner pod logs, as well as the IBM CSI controller pod logs
Here are the requested log files. FYI: I replaced our domain with example.com. host-definer-hostdefiner-786c95d95c-xz5m4-ibm-block-csi-host-definer.log ibm-block-csi-controller-0-ibm-block-csi-controller.log
ok, so here is the issue, as understood from the logs the connectivityType is only utilized when a new host is being defined on the cluster. that said, there is a preliminary step by hostDefiner, checking if a host was was already created on the storage that matches the node's initiators. this check detects the there are NQN on the node, so it checks on the storage side what hosts with those NQNs exist. this is done with the lsnvmefabric command, but since it is not supported on the storage side, the command fails and hostDefiner does into an exception handling branch.
2024-07-17 13:24:26,557 DEBUG [140581538267392] [MainThread] (array_connection_pool.py:create:36) - Creating a new connection for endpoint 5030.prod.example.com 2024-07-17 13:24:26,557 DEBUG [140581538267392] [MainThread] (array_mediator_svc.py:init:266) - in init 2024-07-17 13:24:26,557 DEBUG [140581538267392] [MainThread] (array_mediator_svc.py:_connect:270) - Connecting to SVC 5030.prod.example.com 2024-07-17 13:24:27,576 DEBUG [140581538267392] [MainThread] (utils.py:get_node_id_info:37) - getting node info for node id : ocp0;nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69;51402ec01482c418:51402ec01482c41a:51402ec0110ecba4:51402ec0110ecba6;iqn.1994-05.com.redhat:714b6fdda0da 2024-07-17 13:24:27,577 DEBUG [140581538267392] [MainThread] (utils.py:get_node_id_info:48) - node name : ocp0, nvme_nqn: nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69, fc_wwns : 51402ec01482c418:51402ec01482c41a:51402ec0110ecba4:51402ec0110ecba6, iscsi_iqn : iqn.1994-05.com.redhat:714b6fdda0da 2024-07-17 13:24:27,577 DEBUG [140581538267392] [MainThread] (array_mediator_svc.py:get_host_by_host_identifiers:1043) - Getting host name for initiators : Initiators(nvme_nqns=['nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69'], fc_wwns=['51402ec01482c418', '51402ec01482c41a', '51402ec0110ecba4', '51402ec0110ecba6'], iscsi_iqns=['iqn.1994-05.com.redhat:714b6fdda0da']) 2024-07-17 13:24:27,800 ERROR [140581538267392] [MainThread] (array_mediator_svc.py:_lsnvmefabric:988) - Failed to get nvme fabrics. Reason is: CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'" 2024-07-17 13:24:27,801 ERROR [140581538267392] [MainThread] (host_definer_server.py:define_host:49) - CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'" Traceback (most recent call last): File "/driver/controllers/servers/host_definer/storage_manager/host_definer_server.py", line 31, in define_host found_host_name = self._get_host_name(initiators_from_host_definition, array_mediator) File "/driver/controllers/servers/host_definer/storage_manager/host_definer_server.py", line 75, in _get_host_name found_hostname, = array_mediator.get_host_by_host_identifiers(initiators) File "/driver/controllers/array_action/array_mediator_svc.py", line 1044, in get_host_by_host_identifiers host_names, connectivity_types = self._get_host_names_and_connectivity_types(initiators) File "/driver/controllers/array_action/array_mediator_svc.py", line 1026, in _get_host_names_and_connectivity_types nvme_host_names = self._get_host_names_by_nqn(initiator) File "/driver/controllers/array_action/array_mediator_svc.py", line 997, in _get_host_names_by_nqn nvme_fabrics = self._lsnvmefabric(nqn) File "/driver/controllers/array_action/array_mediator_svc.py", line 990, in _lsnvmefabric raise ex File "/driver/controllers/array_action/array_mediator_svc.py", line 986, in _lsnvmefabric return self.client.svcinfo.lsnvmefabric(remotenqn=host_nqn).as_list File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/client.py", line 139, in call return self.referent(self.context, kwargs) File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/clispec.py", line 211, in call raise e File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/clispec.py", line 207, in call resp = self.resp_helper(resp, extra) File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/response.py", line 84, in init self.result = self.parse(resp, kwargs) File "/opt/app-root/lib64/python3.8/site-packages/pysvc/unified/response.py", line 118, in parse raise CLIFailureError( pysvc.unified.response.CLIFailureError: CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"
so indeed setting the connectivityType doesn't help. we will need to improve this behavior in a future release
for now, removing the NQNs from the host side should mitigate the issue
fixed in 740addaf98c67eba24e6184551882bd362e9fa03 (will be included in the upcoming 1.12.0 release)
Environment:
IBM FlashSystem 5030 (8.5.0.11) connected with FC to three Bare-metal OpenShift Nodes (4.15.19) with IBM block storage CSI driver operator installed (1.11.3)
Problem Description:
New LUNs get created on storage system, but not mapped to hosts. See errors in logs below.
Logs:
pod event: AttachVolume.Attach failed for volume "pvc-bdf5c3bf-a4d9-4bad-b7ee-3733b5184c20" : rpc error: code = Internal desc = CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"
ibm-block-csi-controller-0 log: 2024-07-08 23:42:19,628 ERROR [140216497063680] [SVC:4;60050763808104F70800000000000035] (exception_handler.py:handle_exception:35) - CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"
host-definer-hostdefiner-59c7c7548c-fm7n7 log: ... 2024-07-08 21:45:51,318 DEBUG [140352619017984] [Thread-9] (utils.py:get_node_id_info:48) - node name : ocp0, nvme_nqn: nqn.2014-08.org.nvmexpress:uuid:56f808db-dbc0-4c0d-8dbd-5d0a01120e69, fc_wwns : 51402ec01482c418:51402ec01482c41a:51402ec0110ecba4:51402ec0110ecba6, iscsi_iqn : iqn.1994-05.com.redhat:714b6fdda0da ... 2024-07-08 21:45:51,744 ERROR [140352610625280] [Thread-10] (array_mediator_svc.py:_lsnvmefabric:988) - Failed to get nvme fabrics. Reason is: CLI failure. Return code is 1. Error message is "b'CMMVC7205E The command failed because it is not supported.\n'"
Configuration
Even though we have set environment
CONNECTIVITY_TYPE = fc
on the host-definer-hostdefiner, it seems the CSI still tries to make a connection to the LUN over NVMEoverFC and that obviously fails - since its not available in our setup. 🏴☠️