dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.65k forks source link

More folding to vpternlogd? #107619

Open stephentoub opened 1 week ago

stephentoub commented 1 week ago

Based on the description in https://github.com/dotnet/runtime/pull/91227, I thought each of the following might both compile down to a single vpternlogd:

static Vector512<int> Exp1(Vector512<int> a, Vector512<int> b, Vector512<int> c) =>
    Vector512.ConditionalSelect(a, b & c, b | c);

static Vector512<int> Exp2(Vector512<int> a, Vector512<int> b, Vector512<int> c) =>
    (a & (b & c)) | (~a & (b | c));

but they don't today. The first results in a vpternlogd, but it's the standard one for ConditionalSelect used to choose between the results, and it's thus still computing the and and or separately:

vmovups zmm0, zmmword ptr [r8]
vmovups zmm1, zmmword ptr [r9]
vpandd zmm2, zmm1, zmm0
vpord zmm0, zmm1, zmm0
vpternlogd zmm0, zmm2, zmmword ptr [rdx], -40

The second results in two vpternlogds that are then or'd together:

vmovups zmm0, zmmword ptr [rdx]
vmovups zmm1, zmmword ptr [r8]
vmovups zmm2, zmmword ptr [r9]
vmovaps zmm3, zmm0
vpternlogd zmm3, zmm2, zmm1, -128
vpternlogd zmm2, zmm1, zmm0, 84
vpord zmm0, zmm2, zmm3

rather than a single vpternlogd that handles the whole bitwise operation.

Is this just further opportunity? Or is there something preventing such optimization?

cc: @tannergooding, @EgorBo

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.