Matrix update fast matrix update internals

Rework of FastMatrix implementation which decouples it from FastVector.

This addresses performance concerns as per issue #62. Interface implementation is at the same point as before the merge, with perhaps some unimportant exceptions.

It was decided to implement this rework now instead of later as FastMatrix was still very early in its development; better to tackle performance concerns early than to let them fester and make a rework harder.

The changes here bring FastMatrix on par with DirectXMath equivalents, removing the latency recorded as of issue #62. This includes load-operate-store combinations, where we load a scalar matrix into SIMD, operate on it, and then store it back into a scalar matrix, which was previously significantly slower (although appears to be unrecorded in the mentioned issue). There are occasions where FastMatrix performs better quite consistently, however this is likely just an anomaly in the timings as this occasional lead also goes to DirectXMath.

This also includes the following new functionalities:

Default arguments for fast_matrix_store when outputting a newly constructed scalar Matrix, which will always be the respective argument of the input FastMatrix.
GetRegister function which retrieves the register from a major chunk, with the syntax <MajorIndex, RegisterIndex>, where RegisterIndex is the index within the chunk, not the overall Matrix.
- RegisterIndex defaults to 0 to give a nicer interface in cases such as working with Matrices which have only 1 register per chunk (which is also the common case unless using 64-bit values).
operator* to multiply two FastMatrix instances

BigUglySpider / EmuLibs

Matrix update fast matrix update internals #63