ROCm / rccl-rdma-sharp-plugins

BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

Noticed that the plugins hasn't been updated for more than one year, can this plugin work based on the latest RCCL and UCX 1.15.0? #5

Closed shanleo2024 closed 1 month ago

shanleo2024 commented 12 months ago

Noticed that the plugins hasn't been updated for more than one year, can this plugin work based on the latest RCCL and UCX 1.15.0? I have tested the pulgin on the latest RCCL and UCX 1.15.0, but it cannot work when running the rccl_test, the UCX will report an error like that: ucp_mm.c:855 Assertion `memh->md_map != 0' failed

I want to know how can we use UCX on the RCCL? Which UCX version can work with this the current plugin and RCCL version? Thanks a lot!!!

ppanchad-amd commented 1 month ago

@shanleo2024 Apologies for the lack of response. Internal ticket has been created to investigate this issue. Thanks!

schung-amd commented 1 month ago

Hi @shanleo2024, are you still experiencing this issue?

shanleo2024 commented 1 month ago

Hi @ppanchad-amd @schung-amd , We have found the root cause of this issue:

#if UCP_API_VERSION >= UCP_VERSION(1, 10)
  mh->mem_type = (type == NCCL_PTR_HOST)? UCS_MEMORY_TYPE_HOST: UCS_MEMORY_TYPE_CUDA;
  mmap_params.field_mask  |= UCP_MEM_MAP_PARAM_FIELD_MEMORY_TYPE;
  mmap_params.memory_type = mh->mem_type;
#endif

mh->mem_type will be set to UCS_MEMORY_TYPE_CUDA, not UCS_MEMORY_TYPE_ROCM, so UCX will tigger a hung error. Just change the UCS_MEMORY_TYPE_CUDA to UCS_MEMORY_TYPE_ROCM, then the plugin can be used successfully using UCX 1.17 You can have a test, and if it works, please feel free to close this issue. Thanks.

schung-amd commented 1 month ago

Thanks for the patch! It may be helpful if other users still wish to use this plugin. I've spoken to the internal team and it looks like we might be deprecating this repo, but if we decide to continue support for it we'll look into your fix.