coreylowman commented 1 year ago

I think it might be possible to do inference forwards without any allocations. This would require the following:

Need to be able to compute the max output size of any given nn::Module.
The input tensor would be allocated according to the max size needed. I.e. tensor allocations would have capacity & len, similar to Vec. A given tensor may not be using all the capacity to store its data
All operations would be in place. This includes matmul/conv2d/pool2d/reshape/softmax, etc. a. This would require global synchronization before writing to the output. I think this could be possible in CUDA at least with cooperative groups. I.e. all blocks/threads would need to be synced before updating their output element.

Blocking questions

Is this possible with no extra storage (workspace)?
How much extra memory would be required to make it work?
How much speedup do we get from 0 allocations/frees?
How many cudnn/cublas operations support in place?
How much slowdown do we get from writing out own inplace kernels (e.g. if cublas doesn't have inplace, we'd have to write our own cuda kernel)

coreylowman commented 1 year ago

A potential avenue for in place matrix multiply in cuda would be:

Each thread computes exactly 1 element in the output. (i.e. there are M * N threads split into different blocks) a. Each thread would do the k * k dot product, given a single element
All threads/blocks/grids would need to synchronize with each other,and all at the same time would write the new value to global memory. a. Since each one is synchronized at the same, and since each thread is writing to a different location, all writes could happen at the same time.

I think this could be potentially fast, because in step 1, all threads are doing the exact same number of computations. However each thread also has to access a ton of memory locations since it operates on the whole matrix, so there may be memory access slowness.

nkoppel commented 1 year ago

All threads/blocks/grids would need to synchronize with each other,and all at the same time would write the new value to global memory. I do not think that doing this is possible in the CUDA programming model. By my understanding, blocks must be able to operate in any order independently and there is no way to synchronize blocks execution. We should do "one to one" operations in place as much as possible, but matrix multiplication, convolutions, and reductions require two separate buffers.

coreylowman / dfdx

Zero allocation inference forward #672

Blocking questions