Open coreylowman opened 1 year ago
A potential avenue for in place matrix multiply in cuda would be:
M * N
threads split into different blocks)
a. Each thread would do the k * k
dot product, given a single elementI think this could be potentially fast, because in step 1, all threads are doing the exact same number of computations. However each thread also has to access a ton of memory locations since it operates on the whole matrix, so there may be memory access slowness.
All threads/blocks/grids would need to synchronize with each other,and all at the same time would write the new value to global memory. I do not think that doing this is possible in the CUDA programming model. By my understanding, blocks must be able to operate in any order independently and there is no way to synchronize blocks execution. We should do "one to one" operations in place as much as possible, but matrix multiplication, convolutions, and reductions require two separate buffers.
I think it might be possible to do inference forwards without any allocations. This would require the following:
nn::Module
.Vec
. A given tensor may not be using all the capacity to store its dataBlocking questions