Use overlapping SIMD vectors

Inspired by https://github.com/dotnet/runtime/pull/83488

The relates to where there are loops using Vector, and there are tail elements that don't completely fill a vector. Currently those loops all revert to scalar operations to handle the tail elements, but in some places it will be possible to use a Vector over the tail elements such that some elements are calculated twice (i.e. the tail vector overlaps the last vector slice from the loop).

It seem there are multiple examples of this pattern in the .NET runtime repo; see https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#zeroing

colgreen / Redzen

Use overlapping SIMD vectors #21