Closed shanleo2024 closed 1 month ago
@shanleo2024 Apologies for the lack of response. Internal ticket has been created to investigate this issue. Thanks!
Hi @shanleo2024, are you still experiencing this issue?
Hi @ppanchad-amd @schung-amd , We have found the root cause of this issue:
#if UCP_API_VERSION >= UCP_VERSION(1, 10)
mh->mem_type = (type == NCCL_PTR_HOST)? UCS_MEMORY_TYPE_HOST: UCS_MEMORY_TYPE_CUDA;
mmap_params.field_mask |= UCP_MEM_MAP_PARAM_FIELD_MEMORY_TYPE;
mmap_params.memory_type = mh->mem_type;
#endif
mh->mem_type will be set to UCS_MEMORY_TYPE_CUDA, not UCS_MEMORY_TYPE_ROCM, so UCX will tigger a hung error. Just change the UCS_MEMORY_TYPE_CUDA to UCS_MEMORY_TYPE_ROCM, then the plugin can be used successfully using UCX 1.17 You can have a test, and if it works, please feel free to close this issue. Thanks.
Thanks for the patch! It may be helpful if other users still wish to use this plugin. I've spoken to the internal team and it looks like we might be deprecating this repo, but if we decide to continue support for it we'll look into your fix.
Noticed that the plugins hasn't been updated for more than one year, can this plugin work based on the latest RCCL and UCX 1.15.0? I have tested the pulgin on the latest RCCL and UCX 1.15.0, but it cannot work when running the rccl_test, the UCX will report an error like that: ucp_mm.c:855 Assertion `memh->md_map != 0' failed
I want to know how can we use UCX on the RCCL? Which UCX version can work with this the current plugin and RCCL version? Thanks a lot!!!