cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

libpsm2 cannot find sysfs entry for hfi1_0 on rdma-core v24.0 #43

Open bsmith94 opened 4 years ago

bsmith94 commented 4 years ago

libpsm2 looks for sysfs entries under the path /sys/class/infiniband/hfi1_x. With rdma-core v24.0, the device is renamed according to its device type, PCI bus and device, a la "predictable interface names". This is described at https://patchwork.kernel.org/cover/10870443/ .

On my host, the sysfs path for hfi1_0 is /sys/class/infiniband/opap129s. Thus, libpsm2 fails to find the hfi1_0 sysfs entry in hfi_sysfs_port_open.

The behavior can be observed by executing fi_info on a Debian sid/bullseye host with libfabric-bin and libpsm2-2 installed. The psm2 providers will not be listed in the output. Debug output indicates that no active psm2 device is found.

$ FI_LOG_LEVEL=debug fi_info
...
libfabric:psm2:core:psmx2_init_lib():236<info> PSM2 header version = (2, 1)
libfabric:psm2:core:psmx2_init_lib():238<info> PSM2 library version = (2, 1)
libfabric:psm2:core:psmx2_init_lib():241<info> PSM2 multi-ep feature enabled.
libfabric:psm2:core:psmx2_update_hfi_info():338<warn> Failed to read number of free contexts from HFI unit 0
libfabric:psm2:core:psmx2_update_hfi_info():379<info> hfi1 units: total 1, active 0; hfi1 contexts: total 0, free 0
libfabric:psm2:core:psmx2_update_hfi_info():390<info> Tx/Rx contexts: 0 in total, 0 available.
libfabric:psm2:core:psmx2_getinfo():436<info> no PSM2 device is active.
libfabric:core:core:fi_getinfo_():751<warn> fi_getinfo: provider psm2 returned -61 (No data available)
...

I have found two orthogonal workarounds for this problem:

  1. Use HFI_SYSFS_PATH e.g. HFI_SYSFS_PATH=/sys/class/infiniband/opap129s fi_info. The "129" portion of the HFI_SYSFS_PATH value needs to be set according to the PCI bus of the HFI card.
  2. Or, modify /lib/udev/rules.d/60-rdma-persistent-naming.rules to contain ACTION=="add", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_KERNEL"

While there is a workaround, libpsm2 should address the new, default RDMA device naming scheme. opa_sysfs.c:sysfs_init() looks like the place to start.

mwheinz commented 4 years ago

This change would cause massive problems for much of OPA software, not just the psm2 library, and I do not believe Intel was consulted on this change. I'm not sure how changing device names to vary by each machine they are installed on makes them more "predictable".

bsmith94 commented 4 years ago

This came up again while testing libpsm2 on debian/bullseye. Agreed that the use of the "predictable device name" instead of hfi1_0 would necessitate a lot of change in the other packages.

I have a udev rule that renames the device as hfi1_0, which resolves this issue. Should that rule be packaged with libpsm2? If so, I will submit a pull request.

mwheinz commented 4 years ago

@bsmith94 I'm consulting with my co-workers, but I think we all agree that a new udev rule is the preferred route, but I don't think the fix should be in libpms2 because it would affect more than just psm users. The persistent naming change is also going to impact all the command line utilities, the fabric manager, etc. I'm also a bit concerned that a 60-* prefix might be an issue since the existing udev rules for psm are 40-psm.rules.

Finally, there's the issue of what impact a new udev rule would have on systems that don't have persistent renaming. It would be hard for us to add the change if it's going to negatively impact the majority of our users.

Right now I see a couple of approaches we could take:

  1. Issue the pull request against the developers of the rdma-persistent-naming feature. (No work for me, so, my favorite.) In fact, the change you suggest seems to be already in the kernel-boot project.
  2. Figure out how to make the rename change part of the existing 40-psm.rules, probably using some mechanism so that it is only installed when needed. (Not as simple, doesn't cover other OPA users.)

Thoughts?

acgoldma commented 4 years ago

@bsmith94, sorry for the delay in response, we were discussing internally whether this is a fix we need to include in our release or to file an issue with the linux-rdma maintainers.

After much discussion, we decided that since rdma-core was doing the rename with a user space tool called through a udev rule, that the default udev rule provided by rdma-core needs to change to exclude hfi1 from their rule to rename.

I have filed a patch upstream to the maintainers of linux-rdma to change the default rename behavior. I will update further if and when they accept the patch.

bsmith94 commented 4 years ago

Thanks for the update.

raffenet commented 1 month ago

Is there any further update to this issue? We are likely hitting the same problem on our Ubuntu 22.04 machines with the included OPA packages and libfabric.

libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_getinfo():523<info>
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_init_prov_info():268<info> TAG60 instance included
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_init_prov_info():281<info> TAG64 instance included
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_init_lib():257<info> PSM2 header version = (2, 2)
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_init_lib():259<info> PSM2 library version = (2, 2)
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_init_lib():262<info> PSM2 multi-ep feature enabled.
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_update_hfi_info():427<info> hfi1 units: total -1, active 0; hfi1 contexts: total 0, free 0
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_update_hfi_info():439<info> Tx/Rx contexts: 0 in total, 0 available.
libfabric:3251:1715881079:ofi_rxd:psm2:core:psmx2_getinfo():536<info> no PSM2 device is active.

opainfo

opap134s0:1                        PortGID:0xfe80000000000000:0011750901841a24
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb
   LinkWidth      Act: 4            En: 4
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True
   LID: 0x00000004-0x00000004       SM LID: 0x00000004 SL: 0
         QSFP Copper,       1m  FCI Electronics   P/N 10142057-2010LF   Rev C
   Xmit Data:                  0 MB Pkts:                  761
   Recv Data:                  0 MB Pkts:                  908
   Link Quality: 5 (Excellent)
opap59s0:1                         PortGID:0xfe80000000000000:0011750901846a7c
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb
   LinkWidth      Act: 4            En: 4
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True
   LID: 0x00000005-0x00000005       SM LID: 0x00000004 SL: 0
         QSFP Copper,       1m  FCI Electronics   P/N 10142057-2010LF   Rev C
   Xmit Data:                  0 MB Pkts:                  175
   Recv Data:                  0 MB Pkts:                  173
   Link Quality: 5 (Excellent)
raffenet commented 1 month ago

If I set export HFI_SYSFS_PATH=/sys/class/infiniband/opap59s0 in my environment, libfabric gets further but still fails to create an endpoint.

libfabric:3721:1715881865::psm2:core:psmx2_fabric():90<info>
libfabric:3721:1715881865::core:core:fi_fabric_():1504<info> Opened fabric: psm2
libfabric:3721:1715881865::psm2:domain:psmx2_domain_open():356<info>
libfabric:3721:1715881865::psm2:core:fi_param_get_():373<info> variable lock_level=<not set>
libfabric:3721:1715881865::psm2:core:psmx2_init_tag_layout():171<info> use tag60: tag_mask: 0FFFFFFFFFFFFFFF, data_mask: FFFFFFFF
libfabric:3721:1715881865::core:core:ofi_shm_map():173<warn> shm_open failed
libfabric:3721:1715881865::psm2:av:psmx2_av_open():1121<warn> failed to map shared AV: FI_NAMED_AV_-1

libfabric:3721:1715881865::psm2:core:psmx2_trx_ctxt_alloc():298<info> uuid: 00FF00FF-0000-0000-0000-00FF00FF00FF
libfabric:3721:1715881865::psm2:core:psmx2_trx_ctxt_alloc():303<info> ep_open_opts: unit=-1 port=0
pmrs-gpu-240-02.3721PSM2 no hfi units are active (err=23)
libfabric:3721:1715881865::psm2:core:psmx2_trx_ctxt_alloc():316<warn> psm2_ep_open returns 23, errno=2
ddalessa commented 1 month ago

This is an issue with the RDMA user library. There is a long drawn out argument on the mailing list about this. I will link at the bottom, but to save you the time:

Create/edit /etc/udev/rules.d/rdma-perisistent-naming-rules: ACTION=="add", SUBSYSTEM=="infiniband", KERNEL!="hfi1*", PROGRAM="rdma_rename %k NAME_FALLBACK" Also to make you aware, running psm2 natively is a much better way to run. What you have is libfabric linking in and doing a shim between it and libpsm2. Ideally you could run with the native Omni-Path provider in libfabric. This is "OPX". Let me know if you want help doing either of these.

https://yhbt.net/lore/all/20200205192255.GB414821@unreal/T/

raffenet commented 1 month ago

Is the naming fix required for OPX or can we run without it?

raffenet commented 1 month ago

Is the naming fix required for OPX or can we run without it?

OK I just tested and OPX works on the same node without any modifications. This is good to know.

ddalessa commented 1 month ago

I'm gonna direct that question to @charlesshereda or one of his crew.

raffenet commented 1 month ago

Is the naming fix required for OPX or can we run without it?

OK I just tested and OPX works on the same node without any modifications. This is good to know.

I spoke too soon. I can create an OPX endpoint on my machine, but I can't actually communicate from the looks of it.

libfabric:3780:1715891670::opx:fabric:opx_sysfs_port_open():275<warn> Offending file name: /sys/class/infiniband/hfi1_1/ports/1/state
libfabric:3780:1715891670::opx:fabric:opx_hfi_get_port_active():463<warn> Failed to get logical link state for unit 1:1: No such file or directory
libfabric:3780:1715891670::opx:ep_data:fi_opx_init_hfi_lookup():299<warn> No LID found for HFI unit 1 of 2 units: ret = -2, No such file or directory.
charlesshereda commented 1 month ago

I'm a little behind on everything but I'll either take a look at this or have someone else look next week.

lsavers commented 3 weeks ago

Hi @raffenet, Have you tried using the udev rule @ddalessa suggested in his update?

Create/edit /etc/udev/rules.d/rdma-perisistent-naming-rules: ACTION=="add", SUBSYSTEM=="infiniband", KERNEL!="hfi1*", PROGRAM="rdma_rename %k NAME_FALLBACK"

OPX does not support HFI_SYSFS_PATH and is hardcoded to use /sys/class/infiniband/hfi1_x. Is there a requirement such that the udev rule will not work long term?