Implement an Efficiently Updatable Neural Network (NNUE).

This PR implements a Neural Network based evaluation. The idea is that currently, while being instrumentally fast, StockNemo's evaluation lacks many vital things and is too basic (or rigid). It misevaluates some key positions and blunders because of that.

The neural network architecture picked in this PR is: (768 -> 256)x2 -> 1. It's a half-network combined based on side-to-move, teaching the network tempo. Featuring 197377 trainable parameters, the network is overall small but decent-sized in the realm of NNUE. There's a ClippedReLU method between the Input Layer (768) and the Hidden Layer (256) clamping the values to the range [0, 255].

This neural network is updated efficiently using incremental updates (adding/subtracting) weights from the accumulator (a median between the input and activation function). This allows for fast updates to the neural network (roughly 90ns). The main benefit is that since it's incrementally updated, inference only has to run starting from pre-activation, leading to a super fast inference time of 90ns.

As part of this PR, an inference API with vectorization support has been added to StockNemo. Most of the code is cross-compatible with a significant speed boost for AVX and SSE supporting systems (almost all modern-day systems). Furthermore, there's an extra speed boost for systems supporting AVX-2 and SSE2.

ELO Difference

TC: 10s + 0.1s

Using UHO_XXL_+0.90_+1.19.epd:

ELO   | 341.37 +- 71.44 (95%)
SPRT  | 10.0+0.10s Threads=1 Hash=16MB
LLR   | 3.01 (-2.94, 2.94) [0.00, 5.00]
GAMES | N: 240 W: 202 L: 21 D: 17

TheBlackPlague / StockNemo

Implement an Efficiently Updatable Neural Network (NNUE). #53

ELO Difference

TC: 10s + 0.1s