How to compile the CUDA-enabled version without GPU-enabled hfi1 driver?

RemiLacroix-IDRIS commented 3 years ago

Hello,

We are currently managing all installations for our cluster on a node which does not have GPU and consequently does not have a GPU-enabled hfi1 driver.

Due to the following code snippet, this prevents us from building the CUDA-enabled version of PSM2: https://github.com/intel/opa-psm2/blob/7a33bedc4bb3dff4e57c00293a2d70890db4d983/psm_hal_gen1/psm_hal_inline_i.h#L507-L516

Is there any way to work that around? There is a runtime check to ensure the hfi1 driver is actually GPU-enabled, wouldn't that be enough?

Best regards, Rémi

mwheinz commented 3 years ago

If you look in the IFS package, the CUDA binaries should be there. You should be able to find the CUDA versions of the RPMs and, using commands like opascpall and opacmd, install them on the appropriate ndoes.

RemiLacroix-IDRIS commented 3 years ago

We are in a context where we would like to build PSM2 instead of installing it from the RPMs.

ToddRimmer commented 3 years ago

There is a runtime check to ensure the hfi1 driver is actually GPU-enabled, wouldn't that be enough? I wish it were that simple. Unfortunately cuda and nvidia code is not upstream. As such we must develop 2 versions of the hfi1 driver and build the PSM user space accordingly. Only the cuda enabled version of the hfi1 driver contains the APIs and header files used by the cuda enabled PSM. So to build the cuda enabled PSM, you need to have the cuda enabled hfi1 driver installed so it’s header files are available. As Mike mentions, both packages are available in IFS.

There is an upstream effort ongoing referred to as “DMAbuf” which seeks to solve the issue of peer to peer DMA without having direct driver to driver interactions. This mechanism, once accepted and integrated into other vendors software, can resolve some of the issues.

Todd Rimmer DCG Architecture Voice: 484-245-9487 mailto:Todd.Rimmer@intel.com

From: Michael Heinz notifications@github.com Sent: Tuesday, August 25, 2020 9:49 AM To: intel/opa-psm2 opa-psm2@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [intel/opa-psm2] How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? (#57)

If you look in the IFS package, the CUDA binaries should be there. You should be able to find the CUDA versions of the RPMs and, using commands like opascpall and opacmd, install them on the appropriate ndoes. — You are receiving this because you are subscribed to this thread. Reply to this email directly, https://github.com/intel/opa-psm2/issues/57#issuecomment-680036807, or https://github.com/notifications/unsubscribe-auth/AEKZS22VY2H4ZPNDUWLWGPLSCO6NZANCNFSM4QJ6O66Q.

RemiLacroix-IDRIS commented 3 years ago

That's unfortunate but thanks for the answer.

So to build the cuda enabled PSM, you need to have the cuda enabled hfi1 driver installed so it’s header files are available.

Wouldn't it be possible to distribute the required headers with PSM and test at runtime that the actual driver has the proper capabilities?

BrendanCunningham commented 3 years ago

That's unfortunate but thanks for the answer.

So to build the cuda enabled PSM, you need to have the cuda enabled hfi1 driver installed so it’s header files are available.

Wouldn't it be possible to distribute the required headers with PSM and test at runtime that the actual driver has the proper capabilities?

As we (PSM2) do not maintain hfi1 and we wish for PSM2 to build against the hfi1 headers installed on the system, we are not going to include the hfi1 headers with PSM2.

Runtime check

PSM2 does check at runtime whether the loaded hfi1 has matching GPUDirect capabilities: https://github.com/intel/opa-psm2/blob/7a33bedc4bb3dff4e57c00293a2d70890db4d983/psm_context.c#L537-L550

That is, the following combinations do not work or are not advisable:

PSM2, no CUDA w/ hfi1-gpudirect => fatal
PSM2-CUDA w/hfi1, no GPUDirect support => warning

Building on host that does not have hfi1-gpudirect headers

You can get the hfi1 headers needed to build PSM2 with CUDA support (uapi/rdma/hfi/hfi1_{user,ioctl}.h) from the ifs-kernel-updates-devel .rpm found in an IFS tarball (from Intel RDC).

IFS tarballs for most distros should have both CUDA and non-CUDA ifs-kernel-updates-devel .rpms. Right now hfi1 headers found in both CUDA and non-CUDA ifs-kernel-updates-devel .rpms both have the required CUDA/GPUDirect definitions.

You can install the ifs-kernel-updates-devel .rpm on your build node (headers will go under /usr/include/uapi/rdma/hfi). Alternatively, you can extract the .rpm with rpmdev-extract, place the headers where you want, then edit IFS_HFI_HEADERPATH in psm/buildflags.mak to point to the appropriate uapi/ grandparent of hfi/hfi1{user,ioctl}.h. I have tried this and it works.

Let me know if this helps or if you have any more questions. Thanks.

Brendan

RemiLacroix-IDRIS commented 3 years ago

Just to be sure I understand correctly, this RPM is not installed by default?

BrendanCunningham commented 3 years ago

Just to be sure I understand correctly, this RPM is not installed by default?

No, the IFS 'INSTALL' script should install ifs-kernel-updates-devel.

I am saying that if you did not install IFS on your build node that you can extract the hfi1 headers required to build PSM2 from the ifs-kernel-updates-devel .rpm found in the IFS tarball.

RemiLacroix-IDRIS commented 3 years ago

Ok, then I need to double-check what is happening here because I couldn't find any /usr/include/uapi directory on our nodes, although I am confident that we have IFS installed on those.

cornelisnetworks / opa-psm2

How to compile the CUDA-enabled version without GPU-enabled hfi1 driver? #57

Runtime check

Building on host that does not have hfi1-gpudirect headers