Open a74nh opened 1 month ago
SVE provides LD1RW to load a single 32bit value from memory and broadcast to all lanes in a vector.
I'm not quite sure why we didn't add this to the SVE API.
However, this can be done via:
Vector<uint> vec = Sve.DuplicateSelectedScalarToVector(Sve.LoadVector(Sve.CreateTrueMaskUInt32(), input), 0);
Which produces:
ptrue p0.s ld1w { z17.s }, p0/z, [x7] mov z17.s, s17
This could be optimised to:
ptrue p0.s ld1rw { z17.s }, p0/z, [x7]
Regardless of whether an API method is added, the optimisation should be done.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
SVE provides LD1RW to load a single 32bit value from memory and broadcast to all lanes in a vector.
I'm not quite sure why we didn't add this to the SVE API.
However, this can be done via:
Which produces:
This could be optimised to:
Regardless of whether an API method is added, the optimisation should be done.