Significantly improve NNUE inference code quality.

This PR adds a generic SIMD abstraction for writing NNUE inference code, and uses it to improve the code quality and performance for viri's NNUE.

AVX2 performance is ~equal and passes nonregression:

Elo   | -0.16 +- 2.52 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 3.06 (-2.94, 2.94) [-5.00, 0.00]
Games | N: 34012 W: 7967 L: 7983 D: 18062
Penta | [130, 3648, 9490, 3584, 154]
https://chess.swehosting.se/test/6428/

AVX512 performance is a ~28.9% speedup over master. ARM NEON performance, however, has /cratered/, at about 30% slower than before. I am working on fixing this, and will immediately create a development branch for that after merging this one.

cosmobobak / viridithas

Significantly improve NNUE inference code quality. #140