ROCm / aws-ofi-rccl

Apache License 2.0
12 stars 10 forks source link

Configure not working with default installation #7

Open MarcelKoch opened 1 year ago

MarcelKoch commented 1 year ago

The ./configure command does not automatically pick up the default rocm installation. On the system ROCM_PATH is set to opt/rocm-5.3.0, but the configure step doesn't pick this up, which I would expect from the configure help text. Instead I get the error:

checking hip/hip_runtime.h usability... no
checking hip/hip_runtime.h presence... no
checking for hip/hip_runtime.h... no

If I set --with-hip=$ROCM_PATH/hip it works, but then RCCL is not configured correctly using --with-rccl=$ROCM_PATH/rccl. The configure step succeeds in that case, but make fails with

In file included from nccl_ofi_net.c:15:
In file included from ../include/stack.h:14:
../include/nccl_ofi.h:21:10: fatal error: 'rccl/rccl.h' file not found
#include <rccl/rccl.h>
         ^~~~~~~~~~~~~
1 error generated.
MarcelKoch commented 1 year ago

Update: I got in contact with the support staff for the system I was using, and they were also unable to build this with the default rccl. They had to install their own rccl and use that, which worked. Still, I think it should be possible to build aws-ofi-rccl with the default rccl install, so I will not close the issue yet.