Potential Performance gain by not using (min/max) functions on floating point scalars

unfinishedprogram commented 11 months ago

Currently, the implementation of min_element and max_element for floating point vec3 uses the built in min/max functions. These functions explicitly handle NaN values, and do not propagate them, as is explained here. This is potentially desired behavior, however, this is semantically different from the meaning of min in vectorized contexts where NaN values would be propagated.

The extra handling of NaNs in these functions incurs significant performance overhead. Benchmarking on my computer using a custom f32::min implementation, which does not check for NaN, was ~25% faster.

// Current implementation
pub fn min_element(self) -> f32 {
    self.x.min(self.y.min(self.z))
}

// Faster implementation
pub fn min_element(self) -> f32 {
    let min = |a, b| if a < b { a } else { b };
    min(self.x, min(self.y, self.z))
}

We can see why this is so much faster by comparing the generated assembly: https://godbolt.org/z/3Ph1hWx8e

If I'm missing something please let me know, but otherwise this seems like an easy change which could make a significant difference in performance for some use-cases. I didn't have the time to look into other uses of floating-point min throughout the library but I'd guess it's replacement could improve performance elsewhere as well.

bitshifter commented 10 months ago

Yeah I think that makes sense, especially if it means behaviour is the same with and without SIMD.

bitshifter commented 7 months ago

Turns out there are unstable https://doc.rust-lang.org/core/primitive.f32.html#method.maximum and https://doc.rust-lang.org/core/primitive.f32.html#method.minimum functions that are similar to what you suggest, although there are issues with stabilizing them https://github.com/rust-lang/rust/issues/91079.

unfinishedprogram commented 7 months ago

Interesting, I guess some digging on godbolt is required to determine if this manual "ternary style" replacement compiles to the expected instruction on all architectures.

Another thing I didn't consider with this proposal, is non-optimized code-gen. This method could be a substantial performance degradation for optimized builds. However, I'm not sure how important this is for the goals of this library or if this is even the case. It would need to be tested.

bitshifter / glam-rs

Potential Performance gain by not using (min/max) functions on floating point scalars #444