apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.42k stars 3.4k forks source link

[Bugfix][NCCL] Release NCCL thread_local resources in destructor #17078

Closed Lunderberg closed 4 weeks ago

Lunderberg commented 1 month ago

Prior to this commit, allocations performed by ncclCommInitRank had no corresponding call to ncclCommDestroy. While ncclCommDestroy does occur in the CCLThreadLocalContext::Clear method, there are no calls into this method. On worker processes, the failure to call ncclCommDestroy typically had little effect. Any destruction would occur shortly before the process closes, and so resources would be reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than a separate process. While this allows it to easily receive data from the controller thread, resources allocated by worker0 are not reclaimed by the OS until the entire process terminates. As a result, the CCLThreadLocalContext leaked GPU memory, as the ncclCommInitRank call at the start of each tvm.runtime.disco.ProcessSession was never de-allocated. The increase in GPU memory usage was about 1 gigabyte for each ProcessSession.

This commit updates CCLThreadLocalContext to have a destructor that calls the Clear method. For worker0, this is called when the thread is joined to the main thread.