Lokathor / wide

A crate to help you go wide. By which I mean use SIMD stuff.
https://docs.rs/wide
zlib License
288 stars 24 forks source link

Casting masks between i32x8 and f32x8 #145

Closed ghost closed 9 months ago

ghost commented 9 months ago

Hi, thanks for this great library. I'm a bit new to SIMD programming, so apologies if the answer is obvious I'm trying to port some possibly fishy intrinsics to wide, the operation is:

__m256i a;
__m256i b;
__m256 mask;
 a =    _mm256_castps_si256(
               _mm256_blendv_ps(
                    _mm256_castsi256_ps(a),
                    _mm256_castsi256_ps(b),
                    mask));

Expressed in wide it looks roughly like:

let a: i32x8:
let b: i32x8;
let mask: f32x8; 
a = mask.blend(a,b);

Understandably, this doesn't type check. If I try to cast with bytemuck, it does compile, but I believe it fails because NaN doesn't get mapped to -1, so the integer mask doesn't work? Is this an operation that makes sense? I want to use a mask derived from a float operation, to blend two integer vectors. What is the best way to convert between types?

Thank you!

EDIT:

Larger context:

let sqrt_d =        _mm256_sqrt_ps(discr);
let tmin =          _mm256_sub_ps(neg_b, sqrt_d);   // -b - sqrt(discr)
let tmax =          _mm256_add_ps(neg_b, sqrt_d);   // -b + sqrt(discr) 

let tol =           _mm256_set1_ps(TOLERANCE);
let tmin_gt_mask =  _mm256_cmp_ps(tmin, tol, _CMP_GT_OQ) ;

let t =             _mm256_blendv_ps(tmax, tmin, tmin_gt_mask) ;

let t_gt_mask =     _mm256_cmp_ps(t, tol, _CMP_GT_OQ) ;

let t_lt_mask =     _mm256_cmp_ps(t, hit_dists, _CMP_LT_OQ) ;

let hitmask =       _mm256_and_ps(_mm256_and_ps(discrmask, t_gt_mask), t_lt_mask);

hit_ids =         _mm256_castps_si256(
                      _mm256_blendv_ps(
                          _mm256_castsi256_ps(hit_ids),
                          _mm256_castsi256_ps(iteration_ids),
                          hitmask));

hit_dists =         _mm256_blendv_ps(hit_dists, t, hitmask);

wide version

   let sqrt_d =        (discr.sqrt());
  let tmin =          (neg_b - sqrt_d);   // -b - sqrt(discr)
  let tmax =          (neg_b + sqrt_d);   // -b + sqrt(discr) 

  let tol =           f32x8::from(TOLERANCE);
  let tmin_gt_mask =  tmin.cmp_gt(tol) ;

  let t =             tmin_gt_mask.blend(tmax, tmin);

  let t_gt_mask =     t.cmp_gt(tol) ;

  let t_lt_mask =     t.cmp_lt(hit_dists) ;

  let hitmask: f32x8 =        discrmask & t_gt_mask & t_lt_mask;
  hit_ids =           hitmask.blend(hit_ids, iteration_ids);

  hit_dists =         hitmask.blend(hit_dists, t);
Lokathor commented 9 months ago

i think what you want is round_int or fast_round_int? if i follow

ghost commented 9 months ago

Yeah I think that might be it, maybe:

fn blender(a: i32x8, b: i32x8, mask: f32x8) -> i32x8 {
    mask.blend(a.round_float(), b.round_float()).round_int()
}

I'll try some bit bashing and see what comes out

another possibility:

fn blender(a: i32x8, b: i32x8, mask: f32x8) -> i32x8 {
  cast(mask.blend(cast(a), cast(b)))
  }

as _mm256_cast* is just transmute under the hood

Lokathor commented 9 months ago

i32x8 has a blend method without converting to f32x8, probably will have better performance that way