This is an experimental change to offload matrix multiplication in Linear layers to the GPU.
This could be under-efficient for large batch sizes as the input and output have to be transfered between the host and the device but it yields a decent speed-up for non-batched requests (more experiments are required).
This is an experimental change to offload matrix multiplication in
Linear
layers to the GPU.This could be under-efficient for large batch sizes as the input and output have to be transfered between the host and the device but it yields a decent speed-up for non-batched requests (more experiments are required).