Open yzh119 opened 2 weeks ago
Faster for odd hidden dimensions. Slower for hidden dimension divisible by 4.
Maybe we should use a mixture of BlockLoad/BlockStore and current solution.
Faster for odd hidden dimensions. Slower for hidden dimension divisible by 4.
Maybe we should use a mixture of BlockLoad/BlockStore and current solution.