Closed hongbilu closed 1 year ago
Hi @hongbilu,
GDRCopy is being used in UCX (https://github.com/openucx/ucx). As some MPI implementations (OpenMPI, etc.) use UCX, you can say that GDRCopy is used there too.
GDRCopy excels at small data. When using large data, cudaMemcpy
is better. Usually, you map the buffers you need to do small access. If you need flexibility, you may map all buffers. Doing so will consume GPU BAR1 space. You may want to investigate if your GPU BAR1 space is large enough, which is generally true for data center GPUs.
Hi @hongbilu,
GDRCopy is being used in UCX (https://github.com/openucx/ucx). As some MPI implementations (OpenMPI, etc.) use UCX, you can say that GDRCopy is used there too.
GDRCopy excels at small data. When using large data,
cudaMemcpy
is better. Usually, you map the buffers you need to do small access. If you need flexibility, you may map all buffers. Doing so will consume GPU BAR1 space. You may want to investigate if your GPU BAR1 space is large enough, which is generally true for data center GPUs.
Is that possible to be used by up-layer level APP, eg. xNN, framework, so on
Yes. Some HPC applications also directly use GDRCopy, especially when small copy matters or they cannot use cudaMemcpy
such as inside CUDA Graph host nodes.
Thanks for reply, May I ask that do they report that if API is easy to use? They might pin and map at init phase and need to check copy size and to decide which API to use for copy action, I think this means a not-a-few development cost to them
"Easy" is subjective. But I agree that it will not be as simple as simply calling cudaMalloc
, gdr_pin
, and gdr_map
.
Pin and map are expensive. Usually, applications would not want to call these APIs in their critical paths.
application code always wants to keep common everywhere, that's the reality.
hi, there May I ask that beside nccl and nvshmem, what else does gdrcopy for also? In theory, anywhere to use H2D/D2H with small data size, but that may be a reality that gdrcopy cannot replace cudamemcpy directly, because of gdrcopy API(open,pin, map) which must be initialized at init phase, but it cannot initialize for all the cuda memory at init phase. What do you think?