test: improved batched matmul & LV handling

This gives massive performance boast for small matrices/large batches while the performance eventually matches that of batched_mul at the large scale where their strategy is good enough.

Features

[x] Introduces batched_matmul that potentially calls into batched_mul. Specifically for CPU this method is much faster than calling directly batched_mul.
- [x] Polyester Backend using Octavian.matmul_serial!
- [x] LV Backend
- [x] Size checks
[x] Add rrules
[x] If LV.check_args fails we start up @batch

Fixes

[x] Tracker batched mul gradient by specializing only on 3d Arrays
[x] Patch for batched_mul of Complex Numbers on AMDGPU

Tests

[x] Adds NNlib batched_mul tests but with wider coverage for GPU testing
[x] updated the tests to use batched_matmul
[x] More testing for improved coverage

LuxDL / LuxLib.jl

test: improved batched matmul & LV handling #121

Features

Fixes

Tests