Open twest820 opened 10 months ago
Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.
Author: | twest820 |
---|---|
Assignees: | - |
Labels: | `area-System.Numerics` |
Milestone: | - |
1) there's no a priori way of knowing the mask value, so mask cannot be declared const, 2) mask cannot be marked [ConstExpected]
The warning is because you're using Avx.Blend
which emits VBLENDPS
(or VBLENDPD
). The encoding of that is VBLENDPS xmm1, xmm2/m128, imm8
and so the actual hardware requires you to use a constant. The runtime is currently working around this limitation by emitting a call
to a fallback function that contains a 256 case switch table encoding each of the possible encodable constants.
3) mask is a value type consumed directly by blendps and similar instructions
It is not. APIs like Blend
require a constant and it is not expected for users to pass in the result of movemask
or similar. In some C/C++ compilers, this is straight up unsupported and won't even compile.
There is an alternative API, BlendVariable
, which doesn't require a constant. It emit VBLENDVPS
(or VBLENDVPD
) of which the encoding is VBLENDVPS xmm1, xmm2, xmm3/m128
(or BLENDVPS xmm1, xmm2/m128, <XMM0>
on pre AVX hardware).
The expected usage of this is:
Vector128<float> mask = Avx.CompareEqual(arg1, arg2));
Avx.BlendVariable(arg3, arg4, mask);
Doing this should give you a measurable performance increase and behave more as expected.
Class-level CA1857 suppressions are therefore tempting. I'd prefer a more graceful approach if one's available, though.
You should not suppress these. They are warning you of what is functionally "incorrect" code and where you should be using some other API or alternative mechanism in its place.
There are some places where the JIT will try to recognize and optimize particular patterns when the user doesn't pass in a constant, but these are not guaranteed and may not always be supported by other runtimes. There are likewise many places the JIT doesn't try to handle today and where the developer doing the "right thing" themselves is the best option.
This issue has been marked needs-author-action
and may be missing some important information.
There are notable approved, but not yet implemented, analyzers to help direct users away from problematic patterns and towards more correct alternatives: https://github.com/dotnet/runtime/issues/82488 and https://github.com/dotnet/runtime/issues/82486
Cases like users calling Blend
when they should have called BlendVariable
would be covered by such an analyzer.
Until the analyzer can be implemented and supported, I'm happy to help direct how different [ConstExpected]
warnings should be handled. Either by pointing to an equivalent API or an alternative sequence that will be better performing if there is no "variable" alternative.
D'oh, thanks Tanner! That all sounds great; sorry I missed the extra v
.
Do you have a suggestion for the Extract()
case? Since Permute()
and Shuffle()
are both also [ConstExpected]
(and thus routes through Avx.MoveScalar()
also trigger CA1857) the best I'm seeing at the moment is approaches like
static class AvxExtensions
{
public static float Extract(Vector128<float> value, byte element)
{
return element switch
{
0 => Avx.Extract(value, 0),
1 => Avx.Extract(value, 1),
2 => Avx.Extract(value, 2),
3 => Avx.Extract(value, 3),
_ => throw new ArgumentOutOfRangeException(nameof(element))
};
}
}
This clears the CA1857 but, since it feels like trying to second guess JIT despite the disclaimers above, I'm unsure if it's a better bet codegen-wise than
public static float Extract(Vector128<float> value, byte element)
{
return Avx.Extract(value, element); // suppress CA1857 in suppression file
}
For cases like extract
you should probably be using value[index]
or value.GetElement(index)
. Those do not prescribe a particular instruction being used and will do whatever is deemed "most efficient" for the underlying CPU.
In general, using the "cross platform" equivalent API (like x + y
instead of Sse.Add(x, y)
) is preferred where possible.
From a general "write portable code once" standpoint, I agree. Most of the SIMD code I work with goes like class FooAvx256
or Foo.BarAvx256()
, though, meaning typically it's not portable. Or, alternatively, a GetElement()
call's within CPU dispatched code paths where's probably simpler overall to stay consistent with the non-portable code layout—it's rare I write an inner loop body implementable with a shared ARM-x64 subset.
Another reason I have calls like Avx.Extract()
rather than Vector128<T>.[]
or Vector128<T>.GetElement()
is it's a way of getting VEX instructions into debug and release builds without the overhead of doing publishing and risk of relying on <PublishReadyToRunCrossgen2ExtraArgs>-instruction-set:avx,avx2,fma</PublishReadyToRunCrossgen2ExtraArgs>
, which is a build toolchain feature that took years to deliver and still seems undocumented outside of GitHub issues. It's also nice to elide the checking the CLR has to do in order to provide robust []
or GetElement()
implementations—not a big deal but, if deployed as a general pattern, leaning on application specific context and higher level Debug.Assert()
s does instead yield helpful reduction in runtimes.
I've had a chance in the past couple days to flip several SIMD-bearing repos to .NET 8.0 and am not seeing interestingly different variations from what's already been discussed. I think this could therefore be closed since it looks like all the extract options are awkward somewhere.
-instruction-set:avx,avx2,fma
Notably x86-x64-v3
is probably "better" as it targets the entire set of ISAs required by the spec'd v3 x64 machine.
still seems undocumented outside of GitHub issues
It's a fairly esoteric scenario that largely only impacts startup performance. It's typical that regular R2R performance is fine and any hot functions will be rejitted such that you get stable throughput within the first few seconds regardless.
You have a similar option, IlcInstructionSet
, that works for AOT scenarios.
I think this could therefore be closed since it looks like all the extract options are awkward somewhere.
Many hardware instructions can be fairly awkward to use or come with limitations. That's why we expose the xplat APIs to help cover core patterns where the important thing is to do that as fast as possible and not with very specific instructions.
It's a fairly esoteric scenario that largely only impacts startup performance.
Mmm, when I've looked at telling crossgen2 not to worry about legacy CPUs there've been meaningful runtime effects. Even with most of the runtime in compute kernels explicitly coded to System.Runtime.Intrinsics.X86.Avx
. Nothing huge but worth it you're going to kick off a two week run such. Since PublishReadyToRunCrossgen2ExtraArgs
seems to be .NET's only sort of documented equivalent of C++'s /arch:AVX2
IMO it's a mainline build case rather than anything esoteric.
Thanks for mentioning x86-x64-v3
. I can't get any search hits on it, though, so it's unclear if it's something which can be used externally at this point. Microsoft also seems to be pushing AOT for ASP.NET, which isn't among our use cases, and it's not an option in VS 17.8.2's publish settings. While IlcInstructionSet
has recently gotten low-level GitHub documentation it isn't in the official documentation and also lacks any obvious Visual Studio support. Maybe if one of those gets enough attention to become a feature?
What does work great is your System.Runtime.Intrinsics implementation—thank you for all the time saved in coding C#-C++/CLI-C++ stacks—and then doing our own CPU dispatch. There's some overhead there, sure, but it's been lower cost than trying to coax this aspect of .NET builds into happiness.
Description
Mostly I'm hitting
which is odd as there's 1) there's no a priori way of knowing the mask value, so
mask
cannot be declaredconst
, 2) mask cannot be marked[ConstExpected]
, and 3)mask
is a value type consumed directly byblendps
and similar instructions, so there doesn't appear to be an optimization for the compiler to take in response to the mask being constant.There's also at least one similar case trivially addressed at MSIL gen time by loop unrolling (depending on data marshalling requirements for serialization, actual use cases may have 50+
Extract()
s, meaning manual loop unrolling is unrealistic, IMO ).Configuration
Visual Studio 17.8.1.
Regression?
The
Blend()
disassembly I've been getting looks fine, so I'm inclined to say yes. Perhaps there are corner cases where poor codegen can occur, though.Data
This class may be a conveniently simple example for reference.
Analysis
One possibility is
[ConstExpected]
was added to response toconst int imm8
in the intrinsics' C signatures. From what I know, this doesn't appear to be a correct mapping between C/C++ notions of const, C# notions of const, the implications ofConstExpectedAttribute
, and how (E)VEX SIMD instructions access ymm (or zmm or mask) registers. So possibly the CA1857s here artifacts of the tool used for #80192.If it is a correct warning, it's unclear to me how SIMD intrinsic code for x64 processors can be modified in .NET 8.0 to clear it. Googling gives me plenty of hits on SIMD
[ConstExpected]
being added as an 8.0 feature but nothing on code migration paths. Class-level CA1857 suppressions are therefore tempting. I'd prefer a more graceful approach if one's available, though.