The zip macro was too low level to be called an abstraction and was only used in the 2x2 block upscaling, where it made no impact on speed because memory was the bottleneck when doing absolutely no calculations.
Trying to optionally give direct access to odd instructions was the wrong decision as a whole, because then you might as well write directly for a specific target using platform specific intrinsics. It also does not give enough leeway for optimization when future hardware gets slightly different instructions, while also requiring double work with testing a manual reference implementation with an entirely different algorithm. Better to define higher mathematical compound operations, allowing the library to choose the shortest path to solve the problem as a whole.
The zip macro was too low level to be called an abstraction and was only used in the 2x2 block upscaling, where it made no impact on speed because memory was the bottleneck when doing absolutely no calculations.
Trying to optionally give direct access to odd instructions was the wrong decision as a whole, because then you might as well write directly for a specific target using platform specific intrinsics. It also does not give enough leeway for optimization when future hardware gets slightly different instructions, while also requiring double work with testing a manual reference implementation with an entirely different algorithm. Better to define higher mathematical compound operations, allowing the library to choose the shortest path to solve the problem as a whole.