Simplify the implementation of the 256bit wide types by just composing them of two 128bit parts (a,b) when not compiling for AVX2. This results in exactly the same code after inlining and significantly reducing the complexity.
This will make it easier to add more target support, since it doesn't require copy/pasting for the 256bit implementations. It's also a pattern that might be useful for doing 512bit implementations for AVX-512
Separated this from the neon support PR since it's better to separate feature neutral refactoring from feature additions to catch regressions more easily.
Simplify the implementation of the 256bit wide types by just composing them of two 128bit parts (a,b) when not compiling for AVX2. This results in exactly the same code after inlining and significantly reducing the complexity.
This will make it easier to add more target support, since it doesn't require copy/pasting for the 256bit implementations. It's also a pattern that might be useful for doing 512bit implementations for AVX-512
Separated this from the neon support PR since it's better to separate feature neutral refactoring from feature additions to catch regressions more easily.