Open JuiceLemonLemon opened 2 months ago
Hello, I have a question about AllGather overlap.
I used Hugging face training code. I found during backward of training, the AllGather kernel doesn't overlap CatArrayBatchedCopy kernel. I don't know why.
stream20 AllGather ---------------------------ReduceScatter ---------------------------AllGather stream24 ----------- CatArrayBatchedCopy---------------------------------------------------------CatArrayBatchedCopy
I think GPU CatArrayBatchedCopy can overlap with AllGather, but it's not. Do you have some idea about this?
perhaps try CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=1
Thank you for your reply, and I tried this, but it didn't work.
Hello, I have a question about AllGather overlap.
I used Hugging face training code. I found during backward of training, the AllGather kernel doesn't overlap CatArrayBatchedCopy kernel. I don't know why.
I think GPU CatArrayBatchedCopy can overlap with AllGather, but it's not. Do you have some idea about this?