Reduce number of kernels to initialize zero-init buffer in codegen kernel run

jjsjann123 commented 4 years ago

🚀 Feature

codegen could require multiple global buffers with 0-initialization. Current implementation uses aten::zero to construct buffers, which implies multiple kernel launch for memorySet.

We should be able to aggregate the buffers and reduce them to a single memset kernel.

Alternatives

A few ways to handle this:

We could avoid using aten::zeros and use CudaMalloc and CudaMemset directly, suggested in: https://github.com/csarofeen/pytorch/pull/326#discussion_r477587739 This implies that we maintain our own allocator. Adding complexity to integration code, but it is simple on the codegen part.
Codegen could aggregate the zero-initilized buffer and access the buffer via compensated indexing. This is the opposite of approach 1. It puts the complexity on codegen instead of integration.

jjsjann123 commented 4 years ago

I'm a little skeptical on going ~rouge~rogue with cudaMalloc + cudaMemset instead of using framework's caching allocator. It is a big hammer and will be taking away available GPU memory from framework's memory pool.

tlemo commented 4 years ago

I'm a little skeptical on going ~rouge~rogue with cudaMalloc + cudaMemset instead of using framework's caching allocator. It is a big hammer and will be taking away available GPU memory from framework's memory pool.

The most important optimization IMO is the coalescing of allocations into a single one (actually one for zeroed memory and one for uninitialized memory). Going directly to cudaMalloc would be an additional, but secondary optimization.

csarofeen / pytorch

Reduce number of kernels to initialize zero-init buffer in codegen kernel run #332

🚀 Feature

Alternatives