Open jjsjann123 opened 4 years ago
I'm a little skeptical on going ~rouge~rogue with cudaMalloc + cudaMemset instead of using framework's caching allocator. It is a big hammer and will be taking away available GPU memory from framework's memory pool.
I'm a little skeptical on going ~rouge~rogue with cudaMalloc + cudaMemset instead of using framework's caching allocator. It is a big hammer and will be taking away available GPU memory from framework's memory pool.
The most important optimization IMO is the coalescing of allocations into a single one (actually one for zeroed memory and one for uninitialized memory). Going directly to cudaMalloc
would be an additional, but secondary optimization.
🚀 Feature
codegen could require multiple global buffers with 0-initialization. Current implementation uses
aten::zero
to construct buffers, which implies multiple kernel launch for memorySet.We should be able to aggregate the buffers and reduce them to a single memset kernel.
Alternatives
A few ways to handle this:
We could avoid using
aten::zeros
and useCudaMalloc
andCudaMemset
directly, suggested in: https://github.com/csarofeen/pytorch/pull/326#discussion_r477587739 This implies that we maintain our own allocator. Adding complexity to integration code, but it is simple on the codegen part.Codegen could aggregate the zero-initilized buffer and access the buffer via compensated indexing. This is the opposite of approach 1. It puts the complexity on codegen instead of integration.