Hermetic Benchmarking Prep + CPU Optimizations

Polish some fair benchmarking task (mnist) for Zigrad vs Pytorch on CPU
Support granular ns and allocation analysis for optimization
Provide some comparison feature
Make small optimization tweaks
Everything should plan to run with a hermetic build chain for replication on clean hardware later and public verification (actually implement via hermetic build pr).

Tasks and Changes

[x] Faster FlattenLayer (its still not what I wanted but just for timing)
[x] Tracy for granular ns benchmarking and statistical analysis
[x] Add support for gemm accumulation
[x] Batch matmul with accumulation
- [x] bmm accumulation forward
- [x] bmm accumulation backward
- [x] Tests
[x] add more matmul tests for edge cases
[x] Fair profiling scripts for torch and zigrad mnist
[x] Plotting torch vs zigrad
[x] Add lots log points, remove debug logging calls that had some expensive calculations
[x] Faster LinearLayer (better bias handling)