[Issue]: hipp-runtime-nvidia and rocm-hip-sdk canntot be both installed - EL9

traylenator commented 1 month ago

Problem Description

Starting on RHEL9 and repository yum repo: https://repo.radeon.com/rocm/el9/6.2.2/main

As per the instruction for installing on an nvidia node both hip-devel and hip-runtime-nvidia can be installed. This pulls in hipcc-nvidia which is probably correct.

If you then try to install rocm-hip-sdk this fails due to

  file /opt/rocm-6.2.2/bin/hipcc.bin from install of hipcc-1.1.1.60202-116.el9.x86_64 conflicts with file from package hipcc-nvidia-1.1.1.60202-116.el9.x86_64
  file /opt/rocm-6.2.2/bin/hipconfig.bin from install of hipcc-1.1.1.60202-116.el9.x86_64 conflicts with file from package hipcc-nvidia-1.1.1.60202-116.el9.x86_64

In particular rocm-hip-sdk requires rocm-hip-runtime-devel which in turn requires hipcc resulting in the conflict.

Operating System

Red Hat Enterprise Linux 9.4

CPU

Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

GPU

Tesla T4

ROCm Version

ROCm 6.2.2

ROCm Component

HIPCC

Steps to Reproduce

Configure https://repo.radeon.com/rocm/el9/6.2.2/main

dnf install -y hipp-runtime-nvidia hipd-devel
dnf install -y rocm-hip-sdk

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

ppanchad-amd commented 1 month ago

Hi @traylenator. Internal ticket has been created to investigate your issue. Thanks!

tcgu-amd commented 1 month ago

@traylenator rocm-hip-sdk is meant for AMD platforms (see our documentation), which is why installing it is likely going to result in conflicts with the the NVIDIA hip packages. A potential work-around I can think of is to install these in separate docker containers. Out of curiosity, would you mind letting us know your use case which requires both to be installed?

Hope this helps.

Thanks!

traylenator commented 1 month ago

@tcgu-amd thanks for the comments.

We are running mostly NVIDIA cards today but wanted to try out HIP on those machines. We are hoping to avoid vendor lock-in and (cross-)compile to both AMD and NVIDIA hardware platform from the same build machine at the same time so any hardware migration in the future would be hopefully easier.

tcgu-amd commented 1 month ago

@traylenator Ah I see. That makes sense. Unfortunately I don't think it is possible to achieve this on bare metal at the moment. But, as I mentioned, I think it would be possible to achieve this through installing ROCm in two separate docker containers, given that the drivers and hardware are configured properly on the host system. There will be some redundancy for sure, but unfortunately that is unavoidable at the moment because a lot of our libraries in our runtime is either compiled for NVIDIA or AMD.

Thanks!

traylenator commented 1 month ago

What's the difference between hipcc and hipcc-nvidea?

Certainly for early tests items compiled with hipcc run on NVIDIA?

Is there some point this stops working?

tcgu-amd commented 1 month ago

@traylenator I believe the key difference is that hipcc depends on rocm-llvm, whereas hipcc-nvidia doesn't. The source code of the two versions hipcc themselves are virtually the same; however, as hipcc are just perl wrappers, there might still be discrepancies due to different backends.

tcgu-amd commented 1 month ago

@traylenator, to follow up, it might be possible to resolve the hipcc conflict by uninstalling hipcc-nvidia and then try installing rocm-hip-sdk, since the hipcc installed by rocm-hip-sdk should work for both NVIDIA and AMD runtimes. That being said, it is still strongly recommend to use a containerized environment to avoid further conflicts. Thanks!

traylenator commented 1 month ago

Yes for sure you can install hipcc rather than hipcc-nvidia and all is "good" as I say the results even run on NV.

This can probably be closed.

Thanks for all the responses, much appreciated.

tcgu-amd commented 1 month ago

@traylenator That's cool to hear! Thanks again for reaching out!

ROCm / HIP