NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 829 forks source link

CatArrayBatchedCopy can't overlap with AllGather #1436

Open JuiceLemonLemon opened 2 months ago

JuiceLemonLemon commented 2 months ago

Hello, I have a question about AllGather overlap.

I used Hugging face training code. I found during backward of training, the AllGather kernel doesn't overlap CatArrayBatchedCopy kernel. I don't know why.

stream20 AllGather ---------------------------ReduceScatter ---------------------------AllGather
stream24 ----------- CatArrayBatchedCopy---------------------------------------------------------CatArrayBatchedCopy

I think GPU CatArrayBatchedCopy can overlap with AllGather, but it's not. Do you have some idea about this?

chenhongyu2048 commented 2 months ago

perhaps try CUDA_DEVICE_MAX_CONNECTIONS=1

JuiceLemonLemon commented 2 months ago

perhaps try CUDA_DEVICE_MAX_CONNECTIONS=1

Thank you for your reply, and I tried this, but it didn't work.