Use a formulation that automatically produces the same result in all lanes, avoiding a separate broadcast step.
The same approach would work with floats in principle, but it's not guaranteed to give the same result in all lanes when NaNs are involved (due to the way MINPS/MAXPS are defined), so leave the float versions alone for now.
About 1% encode time reduction encoding a 8192x8192 test texture at 6x6 -thorough on a Ryzen 7950X3D.
Use a formulation that automatically produces the same result in all lanes, avoiding a separate broadcast step.
The same approach would work with floats in principle, but it's not guaranteed to give the same result in all lanes when NaNs are involved (due to the way MINPS/MAXPS are defined), so leave the float versions alone for now.
About 1% encode time reduction encoding a 8192x8192 test texture at 6x6 -thorough on a Ryzen 7950X3D.