Closed JingyuQian closed 4 months ago
Looks like a version or configuration mismatch between two ranks. I'd advise to try again with 2.22 first. Then check the log for environment variables set on one rank but not the other.
If that doesn't bring anything, you may post the full log here and we can take a look.
Probably something with my code. Haven't run into that since. I'll close it for now.
Hello,
I'm looking at this problem of NCCL. Similar problems have been posted (like #626 ) and I've tried the suggestions but with no solution so far.
Environment
Context I tried multi-node training with model A, it works fine. Then I tried the same setting with model B (same repo, different config) and faced this error.
It looks like
opCount c
starts to produce this error.Also, I tried
NCCL_PROTO=SIMPLE
and then the program raisestorch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory
. After looking into it, the master broadcasts a single-bool tensor of size 4 (Tensor([4], device='cuda'
), but the broadcast received by worker appears to be very large and when it uses that to initialize a tensor on GPU, the error raises.Rank 0 log
Rank 1 log
Any insight would be helpful. Thank you.