NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
898 stars 144 forks source link

question: what else does gdrcopy use for? #259

Closed hongbilu closed 1 year ago

hongbilu commented 1 year ago

hi, there May I ask that beside nccl and nvshmem, what else does gdrcopy for also? In theory, anywhere to use H2D/D2H with small data size, but that may be a reality that gdrcopy cannot replace cudamemcpy directly, because of gdrcopy API(open,pin, map) which must be initialized at init phase, but it cannot initialize for all the cuda memory at init phase. What do you think?

pakmarkthub commented 1 year ago

Hi @hongbilu,

GDRCopy is being used in UCX (https://github.com/openucx/ucx). As some MPI implementations (OpenMPI, etc.) use UCX, you can say that GDRCopy is used there too.

GDRCopy excels at small data. When using large data, cudaMemcpy is better. Usually, you map the buffers you need to do small access. If you need flexibility, you may map all buffers. Doing so will consume GPU BAR1 space. You may want to investigate if your GPU BAR1 space is large enough, which is generally true for data center GPUs.

hongbilu commented 1 year ago

Hi @hongbilu,

GDRCopy is being used in UCX (https://github.com/openucx/ucx). As some MPI implementations (OpenMPI, etc.) use UCX, you can say that GDRCopy is used there too.

GDRCopy excels at small data. When using large data, cudaMemcpy is better. Usually, you map the buffers you need to do small access. If you need flexibility, you may map all buffers. Doing so will consume GPU BAR1 space. You may want to investigate if your GPU BAR1 space is large enough, which is generally true for data center GPUs.

Is that possible to be used by up-layer level APP, eg. xNN, framework, so on

pakmarkthub commented 1 year ago

Yes. Some HPC applications also directly use GDRCopy, especially when small copy matters or they cannot use cudaMemcpy such as inside CUDA Graph host nodes.

hongbilu commented 1 year ago

Thanks for reply, May I ask that do they report that if API is easy to use? They might pin and map at init phase and need to check copy size and to decide which API to use for copy action, I think this means a not-a-few development cost to them

pakmarkthub commented 1 year ago

"Easy" is subjective. But I agree that it will not be as simple as simply calling cudaMalloc, gdr_pin, and gdr_map.

Pin and map are expensive. Usually, applications would not want to call these APIs in their critical paths.

hongbilu commented 1 year ago

application code always wants to keep common everywhere, that's the reality.