Open XHANYAO opened 1 year ago
Is it possible to initialize without having to obtain the uniqueID again?
It is not possible for now. Actually, uniqueID is not tied to any communicators, so ncclCommDestroy
should not touch anything about uniqueID.
We make it separate for each comm init so that we can reclaim everything from uniqueID when the init is done.
Thank you very much for your reply. I see. I really can't find the use of uniqueID in ncclCommDestroy
,that's right.But there's still something I don't understand. Could you be more specific about
We make it separate for each comm init so that we can reclaim everything from uniqueID when the init is done.
So could you tell me more details about reclaim everything from uniqueID?
In fact, I also don't understand why the debug log shows the error in socketconnect
and why the error occurs here.Does this mean that even if ncclSocketClose
and ncclCommDestroy
are executed, this port under this IP is still occupied?
As the title indicates, I discovered that in order for the program to function properly, I had to re-obtain the uniqueID when I ran ncclCommInitRank. Since I was unaware of NCCL until a month ago, I'm new to the site. I was recently playing with nccl in a setup with a single GPU and one NIC. I ran into similar issue when I tried to use two communicators in this setting. Follow is the code:
As you can see, I attempted the second initialization with comm1 and s1 after finishing the first initialization with comm and s and releasing the resource with
ncclCommDestroy
. Also I try to change the code like follow to test the problem,just use the same comm:the problem are same. DUBUG log is here:
The debug log shows that the program is stuck in bootstrapinit for the second initialization, specifically in
socketconnect
This is where the issue appears. It appears to be using the same port, so I'm not sure why it was wrong.I apologize for posing what could have been a very basic query.After getting the unique ID back, I can confirm that everything is well and that the port number has changed. However, I am confused why obtaining uniqueID is required when
ncclcommdestroy
will suffice to release these resources.Is it possible to initialize without having to obtain the uniqueID again?
Finally, if there are any other issues with the code, please ignore it and let's focus on this issue