[Bugfix][NCCL] Release NCCL thread_local resources in destructor

Prior to this commit, allocations performed by ncclCommInitRank had no corresponding call to ncclCommDestroy. While ncclCommDestroy does occur in the CCLThreadLocalContext::Clear method, there are no calls into this method. On worker processes, the failure to call ncclCommDestroy typically had little effect. Any destruction would occur shortly before the process closes, and so resources would be reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than a separate process. While this allows it to easily receive data from the controller thread, resources allocated by worker0 are not reclaimed by the OS until the entire process terminates. As a result, the CCLThreadLocalContext leaked GPU memory, as the ncclCommInitRank call at the start of each tvm.runtime.disco.ProcessSession was never de-allocated. The increase in GPU memory usage was about 1 gigabyte for each ProcessSession.

This commit updates CCLThreadLocalContext to have a destructor that calls the Clear method. For worker0, this is called when the thread is joined to the main thread.

apache / tvm

[Bugfix][NCCL] Release NCCL thread_local resources in destructor #17078