NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
898 stars 144 forks source link

How to effectively test if gdrcopy is enabled using Real world ML workload ? #292

Open pandyamarut opened 9 months ago

pandyamarut commented 9 months ago

I have successfully installed gdrcopy on my host and completed its tests. Afterwards, I launched a container running my language model application, with a focus on profiling the loading of the model from the local disk. I am looking for methods to confirm whether gdrcopy is active when my application is running. Since I am new to this, I would appreciate any guidance on how to verify the operation of gdrcopy in this context.

pakmarkthub commented 9 months ago

Hi @pandyamarut, Based on your question, my guess is that your application does not use GDRCopy directly. Probably you want to confirm that a library (e.g., UCX, NCCL) is properly utilizing GDRCopy? One way to do so is to export the environment variables below and rerun your application. If GDRCopy is used, you should see some output lines from GDRCopy.

export GDRCOPY_ENABLE_LOGGING=1
export GDRCOPY_LOG_LEVEL=1
dhayanesh commented 7 months ago

@pandyamarut where you able to verify whether your application is utilizing it?