For non-AVX2 targets delegate to 128bit subparts

Simplify the implementation of the 256bit wide types by just composing them of two 128bit parts (a,b) when not compiling for AVX2. This results in exactly the same code after inlining and significantly reducing the complexity.

This will make it easier to add more target support, since it doesn't require copy/pasting for the 256bit implementations. It's also a pattern that might be useful for doing 512bit implementations for AVX-512

Separated this from the neon support PR since it's better to separate feature neutral refactoring from feature additions to catch regressions more easily.

Lokathor / wide

For non-AVX2 targets delegate to 128bit subparts #125