This gives massive performance boast for small matrices/large batches while the performance eventually matches that of batched_mul at the large scale where their strategy is good enough.
Features
[x] Introduces batched_matmul that potentially calls into batched_mul. Specifically for CPU this method is much faster than calling directly batched_mul.
[x] Polyester Backend using Octavian.matmul_serial!
[x] LV Backend
[x] Size checks
[x] Add rrules
[x] If LV.check_args fails we start up @batch
Fixes
[x] Tracker batched mul gradient by specializing only on 3d Arrays
[x] Patch for batched_mul of Complex Numbers on AMDGPU
Tests
[x] Adds NNlib batched_mul tests but with wider coverage for GPU testing
This gives massive performance boast for small matrices/large batches while the performance eventually matches that of batched_mul at the large scale where their strategy is good enough.
Features
batched_matmul
that potentially calls intobatched_mul
. Specifically for CPU this method is much faster than calling directlybatched_mul
.Octavian.matmul_serial!
LV.check_args
fails we start up@batch
Fixes
Tests
batched_matmul