Performance Regression: UTF8 Span Methods 3x slower on AMD Platforms (NetCore2.1 -> 3.1)

jamescourtney commented 4 years ago

Performance of some UTF8 string methods seems to have regressed in a major way on AMD platforms in .NET Core 3.1. Others may be affected as well.

Test:

    [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.NetCoreApp21)]
    [SimpleJob(BenchmarkDotNet.Jobs.RuntimeMoniker.NetCoreApp31)]
    public class Benchmark
    {
        const string testString = "hello world";
        private static readonly Encoding Utf8NoBom = new UTF8Encoding(false);

        private readonly byte[] writeBuffer = new byte[1024];
        private readonly byte[] readBuffer = Utf8NoBom.GetBytes(testString);

        [Benchmark]
        public void WriteString()
        {
            Utf8NoBom.GetBytes(testString.AsSpan(), this.writeBuffer.AsSpan());
        }

        [Benchmark]
        public void ReadString()
        {
            Utf8NoBom.GetString(this.readBuffer.AsSpan());
        }

        static void Main(string[] args)
        {
            BenchmarkRunner.Run<Benchmark>();
        }
    }

AMD system results:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=3.1.101
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-RNGYVQ : .NET Core 2.1.15 (CoreCLR 4.6.28325.01, CoreFX 4.6.28327.02), X64 RyuJIT
  Job-ILXJIL : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

|      Method |       Runtime |     Mean |    Error |   StdDev |
|------------ |-------------- |---------:|---------:|---------:|
| WriteString | .NET Core 2.1 | 24.68 ns | 0.187 ns | 0.175 ns |
|  ReadString | .NET Core 2.1 | 23.33 ns | 0.373 ns | 0.349 ns |
| WriteString | .NET Core 3.1 | 70.51 ns | 0.787 ns | 0.736 ns |
|  ReadString | .NET Core 3.1 | 86.46 ns | 0.153 ns | 0.143 ns |

Intel system results:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.101
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-POUEZF : .NET Core 2.1.15 (CoreCLR 4.6.28325.01, CoreFX 4.6.28327.02), X64 RyuJIT
  Job-JZNDHJ : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

|      Method |       Runtime |     Mean |    Error |   StdDev |
|------------ |-------------- |---------:|---------:|---------:|
| WriteString | .NET Core 2.1 | 34.38 ns | 0.709 ns | 1.204 ns |
|  ReadString | .NET Core 2.1 | 31.24 ns | 0.652 ns | 1.376 ns |
| WriteString | .NET Core 3.1 | 13.61 ns | 0.301 ns | 0.519 ns |
|  ReadString | .NET Core 3.1 | 29.86 ns | 0.623 ns | 1.340 ns |

scalablecory commented 4 years ago

@GrabYourPitchforks

GrabYourPitchforks commented 4 years ago

@tannergooding Is the Ryzen CPU listed here one of the models that suffers from poor pdep / pext performance? Those instructions can occur on the ASCII processing code paths.

I can investigate using some of the SIMD instructions here where possible (packuswb and the like), but it's almost certainly going to increase the latency in these code paths.

jamescourtney commented 4 years ago

That's really interesting, actually. I've learned something today!

Based on my reading, it looks as if this instruction is implemented in microcode on my chip, while Intel has a hardware implementation for it. Also understand why the CLR wouldn't want to be in the business of micro optimizations like this for different architectures.

Running a quick experiment using the Kernel32 APIs for MultiByteToWideChar were slightly quicker, but still not as fast as NetCore2.1.

Thanks for the responses here.

tannergooding commented 4 years ago

Yes. pdep/pext are implemented in Microcode for all zen microarchitectures.

You can see this in the x86 instruction scheduling model that AMD contributed for LLVM:

WriteMicrocoded is defaulted to Latency = 100: https://github.com/llvm/llvm-project/blob/master/llvm/lib/Target/X86/X86ScheduleZnver1.td#L487-L490. Nearby comments indicate "We set long latency so as to block the entire pipeline."

This might be a good first usecase for https://github.com/dotnet/runtime/issues/785. The savings on platforms with fast pdep/pext are significant, but so are the drawbacks for platforms with slow versions.

GrabYourPitchforks commented 4 years ago

@jamescourtney - Can you set the environment variable COMPlus_EnableBMI2=0 and see if that improves your performance? That will disable the BMI2-specific instructions that we use as an optimization.

(Tanner informs me that the environment variable processing logic is case-sensitive on Linux. It shouldn't matter on Windows.)

jamescourtney commented 4 years ago

Setting that environment variable resolves the issue completely. Write performance more than doubles over NetCore2.1:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=3.1.101
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-YGYXSM : .NET Core 2.1.15 (CoreCLR 4.6.28325.01, CoreFX 4.6.28327.02), X64 RyuJIT
  Job-LLDHGN : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

|      Method |       Runtime |      Mean |     Error |    StdDev |
|------------ |-------------- |----------:|----------:|----------:|
| WriteString | .NET Core 2.1 | 24.480 ns | 0.2870 ns | 0.2684 ns |
|  ReadString | .NET Core 2.1 | 24.004 ns | 0.2095 ns | 0.1960 ns |
| WriteString | .NET Core 3.1 |  9.832 ns | 0.0429 ns | 0.0358 ns |
|  ReadString | .NET Core 3.1 | 23.654 ns | 0.1812 ns | 0.1606 ns |

damageboy commented 4 years ago

Just adding a small something to this discussion. While it's early in my morning and I'm too groggy to actually look at the code calling PDEP... If we had something like CPUID and willingness to use specific code paths depending on CPUID @ runtime, some PDEPs can be converted rather efficiently to beat AMD's microcode:

https://github.com/zwegner/zp7

jkotas commented 4 years ago

CPUID and willingness to use specific code paths depending on CPUID

The model for the hardware intrinsic is complex already. This would make it even more complex than what it is today.

My suggestion for how to fix this issue would be to make Bmi2.IsSupported false on AMD64 processors. Once/if the AMD64 processors get this performance bug fixed, we can update the runtime to detect the versions with the fix and start returning true from this method again.

GrabYourPitchforks commented 4 years ago

I'm torn on whether the runtime lying about the intrinsic being supported is the correct approach. That seems put the runtime into the position of making a value judgment. Up until now the intrinsics have been only raw capabilities checks.

jkotas commented 4 years ago

This should be a point in time problem. I expect AMD to fix this performance bug and then all this will be legacy that I everybody will try to forget about.

The reason why I have proposed implementing the workaround in the runtime is that it will be done in just one central place instead of spread over all places that use this instruction. As with any workaround, there may be corner cases where it may not be perfect, but that's ok.

benaadams commented 4 years ago

That seems put the runtime into the position of making a value judgment.

Loose some non-flag altering instructions (probably not high usage); and BZHI which can be replaced by [src & (1 << inx)-1] which is the downside and doesn't seem too high?

However between 18x and 290x times slower for PDEP and PEXT makes them completely undesirable instructions to use?

Can provide a more general purpose fallback versions in BitOperations for .NET 5.0?

tannergooding commented 4 years ago

My suggestion for how to fix this issue would be to make Bmi2.IsSupported false on AMD64 processors

I would be strongly against this approach both in terms of how the hardware intrinsics have been exposed and as someone who has an AMD processor. Bmi2 exposes 10 APIs, only 4 of which have an issue. We shouldn't restrict the ability to use the other 6 APIs just because 4 of them have bad perf.

We have provided no guarantees that using a given hardware intrinsic will be "performant"; only that it will emit the instruction a given API is mapped to. It has always been up to the user to utilize them efficiently (which includes pipelining, dealing with micro-architectural differences, etc). I also don't think it will be a point in time issue. Not all customers upgrade their CPUs every 3-5 years and we'll be in the same position for the AMD Zen1/2 CPUs even if Zen3 ends up resolving the issue. Looking at the CPUs that Azure lists (https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/), the majority are "Broadwell" which is circa 2013; just to give a rough timeline on how long I think we would need to support the workaround if implemented.

The reason why I have proposed implementing the workaround in the runtime

I'm not convinced we need a workaround. Looking at the Intel numbers with BMI2 enabled and the AMD numbers with BMI2 disabled; they are actually similar:

Intel Write: 34.38 ns to 13.61 ns
Intel Read:  31.24 ns to 29.86 ns
AMD Write:   24.48 ns to  9.83 ns
AMD Read:    24.00 ns to 23.65 ns

We should likely, first and foremost, get some numbers on how much "gain" Intel is getting via PDEP/PEXT here. My initial speculation is that they are not actually providing that great of a benefit and we would likely minimally regress Intel (while significantly boosting AMD) by just dropping their usage in the framework (at least here where it is only used "once" per call; and not in a hot loop).

The model for the hardware intrinsic is complex already. This would make it even more complex than what it is today.

I don't believe it will be significantly more complex. We are already testing the various ISA combinations. We will continue to get full coverage of code paths even if we codify a few paths to do if (Isa.IsSupported && !Cpuid.IsAuthenticAMD) { } else { /* code path where Isa is not supported */ } (although, as per the above; I don't believe this is required).

jkotas commented 4 years ago

One of the key values of .NET platform is simplicity. Among other things, we hide number of OS bugs, quirks and performance issues by including workarounds for them in the platform, so that our users do not have to worry about them. These workarounds may stay in the runtime code for years, but they make the .NET platform better in the long run.

We shouldn't restrict the ability to use the other 6 APIs just because 4 of them have bad perf.

Yep, the workarounds are not perfect. As @benaadams explained, the other instructions do not seem to be that valuable, so we are not losing much.

dropping their usage in the framework (at least here where it is only used "once" per call; and not in a hot loop).

The effect of this is nearly identical to returning false from Bmi2.IsSupported, except it is spread to more places and everybody using these intrisics needs to worry about it.

if (Isa.IsSupported && !Cpuid.IsAuthenticAMD)

I do not think we would want to promote this pattern. It is not future proof. It will not work well once AMD fixes this performance bug.

tannergooding commented 4 years ago

One of the key values of .NET platform is simplicity. Among other things, we hide number of OS bugs, quirks and performance issues by including workarounds for them in the platform, so that our users do not have to worry about them. These workarounds may stay in the runtime code for years, but they make the .NET platform better in the long run.

There are exceptions to every rule and System.Runtime.Intrinsics violates most of the standard simplicity rules the framework has. It is ~1500 hardware specific APIs (and that is practically doubling with the introduction of the ARM intrinsics). It requires the user to write unsafe code, to code multiple similar algorithms for the various paths to ensure good codegen is possible on all paths and that a software fallback is available.

There can already be large performance differences for a given instruction depending on what microarchitecture you are comparing; it isn't just limited to PDEP/PEXT (which happens to be an extreme case). If you go back to pre-AVX machines; you will get severe penalties for using unaligned loads/stores. If you use non-temporal stores (even on modern hardware) for the wrong scenario, you can be looking at ~400 cycle latency rather than 1 cycle latency; but used in the right scenario (such as large memcpy), you can achieve over a 2x perf increase. If you compare "Broadwell" to the newer "Skylake-X" (https://www.agner.org/optimize/instruction_tables.pdf), there are several instructions that have gotten slower and several that have gotten faster (sometimes significantly). Developers using hardware intrinsics need to consider these scenarios; us included.

The other instructions do not seem to be that valuable, so we are not losing much.

I believe this is going to set a bad precedent for the hardware intrinsics. Especially if this happens to a larger ISA where a couple of instructions happen to behave poorly. For example, we may eventually add AVX-512 support. That support is relatively new on Intel and doesn't exist on AMD yet, but may exist there someday. AVX-512 is split into multiple feature ISAs and exposes ranges of instructions. It may be that one of the two (Intel or AMD) has significantly worse performance on an instruction as compared to the other. Someone will likewise log a bug and then this will be looked to as an example of how it should be handled.

In my mind, the correct thing is to just reiterate that this is an extremely low level feature that allows you to fine tune performance, but as part of that, the amount of code you need to write and what considerations you need to take are also greatly expanded (it is effectively writing pseudo-inline assembly after all, just without control over register allocation and spilling).

The effect of this is nearly identical to returning false from Bmi2.IsSupported, except it is spread to more places and everybody using these intrisics needs to worry about it.

Yes, which is something developers already need to be concerned about. They already need to consider manufacturer and microarchitecture differences because what works well on "X" may not work well on "Y". There is generally some pattern that works well on both, but that is not always the case/scenario.

It is not future proof. It will not work well once AMD fixes this performance bug.

Neither is disabling the functionality for AMD systems. We will still have the consideration of Znver1/Znver2 vs the new ZnverX that has it be faster. CPUID is meant to expose hardware specific information such as the manufacturer ID, vendor, and model numbers. It can therefore be used to limit code paths to manufacturers (IsGenuineIntel vs IsAuthenticAMD) or even specific CPU models (IsAuthenticAMD && (ExtendedModelId < value)). The difference is whether it is internalized to the JIT and prevents use of instructions on hardware that supports it or if it puts the responsibility on the developer to ensure the code path is efficient on all hardware they plan on targeting (which I believe is more inline with the principles of the feature and what developers already need to consider).

jkotas commented 4 years ago

System.Runtime.Intrinsics violates most of the standard simplicity rules the framework has

Right. So it is a good reason to not make it worse.

I believe this is going to set a bad precedent for the hardware intrinsics

I believe that this would be a good precedent for the hardware intrinsics. If there is a configuration with broken implementation of the hardware intrinsics (whether it is broken functionality or broken performance), we should pretend that the specific hardware intrinsics do not exist in the runtime. It would not be the first time we have done such thing. For example, we used to artificially disable vectorized Vector<T> in some configurations.

limit code paths to manufacturers (IsGenuineIntel vs IsAuthenticAMD) or even specific CPU models (IsAuthenticAMD && (ExtendedModelId < value))

Some of our important customers are not happy with dependency on JIT for these paths. We have plans to implement AOT for hardware intrinsics by pregenerating multiple variants of the method (it is runtime teams OKR). If you start allowing special casing like this, the matrix for this is going to get out of control.

the principles of the feature

I believe that underlying question of this discussion (and several other discussion in other issues) is whether we need to go out of our way to expose 100% of the hardware intrinsics potential, or whether it is good enough to keep things simpler and expose just 95%. I strongly believe that 95% is good enough.

tannergooding commented 4 years ago

we should pretend that the specific hardware intrinsics do not exist in the runtime

There is not an easy way to do that without disabling an entire ISA and that won't be viable for larger ISAs where performance problems exist.

If you start allowing special casing like this, the matrix for this is going to get out of control.

Unless I'm missing something, that is already going to be the case. Once AOT comes into play, you need to have a runtime or compile time consideration for this performance difference.

For the BMI2 code path, that means the AOT code needs to do one of:

Always go down BMI2 if hardware supports it (fast on Intel, slow on current AMD)
- Current state and default state for any new ISAs introduced
Have a check for AMD vs Intel to ensure both are fast and Intel can use BMI2
- The only state where ISA can be supported without degrading perf on one platform
Disable BMI2 everywhere (negligible perf degredation on Intel but still over 2x faster than 2.1; and fast on AMD)
- In this case, it doesn't appear to be a problem given that PDEP/PEXT aren't used in hot paths; only on the tail end of a hot loop. It would likely be better to just drop the usage of PDEP/PEXT here after measuring the actual impact (which I plan on measuring here shortly).

tannergooding commented 4 years ago

(sorry, hit a shortcut key by accident)

jkotas commented 4 years ago

that won't be viable for larger ISAs where performance problems exist.

I think it would be perfectly viable. Half-done hardware features are hard for the software to deal with. It is a problem for hardware manufactures to fix, not a problem for software to invent a perfect workarounds for.

you need to have a runtime or compile time consideration for this performance difference.

Right. The plan is to have a matrix. For example, if the method contains Bmi2.IsSupported check, we are going to have one copy of the code of the method compiled for when it returns true and one copy of the method for when it returns false. At runtime, we will pick the right one based on what Bmi2.IsSupported returns.

If you allow e.g. checks on ModelId versions, the complexity of how this is detected, encoded and propagated through the system goes through the roof.

tannergooding commented 4 years ago

The plan is to have a matrix. For example, if the method contains Bmi2.IsSupported check, we are going to have one copy of the code of the method compiled for when it returns true and one copy of the method for when it returns false. At runtime, we will pick the right one based on what Bmi2.IsSupported returns.

Yes, I understand this. It is very similar to the outer loop tests we already run where we cover all the valid ISA configurations using the COMPlus_Enable{ISA}=0 flags.

What I don't understand is how you plan on addressing the perf of PDEP/PEXT for Intel vs AMD here... Actually maybe something just partially clicked. When you said AMD64 were you referring to all x86 processors (both Intel and AMD; both 32-bit and 64-bit) or just AMD processors?

I'm still not clear on how this will be resolved in the future; where BMI2 is "fast" on some future AMD machine but is still slow on the Znver1/Znver2 machines (which will likely still be in use; in both developer machines and in the Azure VMs that utilize it for presumably the next 7-10 years).

Half-done hardware features are hard for the software to deal with

Having a particular instruction be slow doesn't make it half done. Both Intel and AMD have varying performance for instructions based on a number of factors, including what is believed to be "mainline" scenarios. There are some instructions (like MASKMOVDQU) that get slower from an older generation to a newer one because of this.

jkotas commented 4 years ago

What I don't understand is how you plan on addressing the perf of PDEP/PEXT for Intel vs AMD here

The right AOT method body to use will be picked on the information computed by EEJitManager::SetCpuInfo.

If we want with my proposed solution, we would add && !IsGenuinceAMD() here: https://github.com/dotnet/runtime/blob/fcd862e06413a000f9cafa9d2f359226c60b9b42/src/coreclr/src/vm/codeman.cpp#L1456 , and both the JIT and the logic that selects that right AOT method body would pick it up.

get slower from an older generation to a newer one because of this.

It is ok to get 2x slower - happens all the time. It is not ok to get 100x slower - it is what I consider broken.

tannergooding commented 4 years ago

If we want with my proposed solution, we would add && !IsGenuinceAMD() here

I don't see how this is significantly better than allowing the same check to be done on the managed side. If you do it in the JIT, then AMD machines can't use any BMI2 instructions and devs don't need to consider the platform differences. If you do it in managed land, devs can customize the scenario based on what they are doing but must take these differences into account (much as they already need to for other scenarios).

In either case, we will eventually need to either accept the perf loss on specific hardware or limit the disablement to specific microarchitectures (e.g. only disable on AMD Znver1/Znver2 but not on AMD FutureMicroarch). The latter (codifying it in managed land) case allows code to be revised independently of the JIT and allows devs to continue having control over the hardware intrinsics they are able to use.

The latter will also likely become possible if https://github.com/dotnet/runtime/issues/785 is reviewed and approved. Even if the CPUID call itself is done via calling the __cpuid C/C++ intrinsic via an FCALL; devs will be able to cache information like IsAuthenticAMD or the vendor/model information in a static readonly and the JIT should end up properly optimizing it (as it already does with other similar scenarios). It would not be a lot of effort to just have the JIT cache that information as part of the existing CPUID checks we do in the VM and to have them be explicit constants.

jkotas commented 4 years ago

I don't see how this is significantly better than allowing the same check to be done on the managed side.

It believe that it is better because of it is much simpler. It is the question that I have said above - whether we need to go out of our way to expose 100% of the hardware intrinsics potential, or whether it is good enough to keep things simpler and expose just 95%.

GrabYourPitchforks commented 4 years ago

Would it be worthwhile for me to experiment with removing BMI2 from the transcoding code paths? Where it's not in a tight loop we can remove intrinsics entirely, and where it is in a tight loop I might be able to substitute a SIMD implementation. It might regress Intel performance slightly but it should substantially improve AMD performance.

tannergooding commented 4 years ago

My belief is that it goes against the principles of the hardware intrinsics. They were designed to give developers the ability to access the functionality available in the underlying hardware. While one of the main focuses is performance, there is no possible way for us to "cleanly" expose only things which are always performant (as it greatly depends on the use-case etc). Even for PDEP/PEXT (on AMD); there are some cases where it takes as few as 18 cycles and some cases where it takes as many as almost 300.

Developers already have to code support for vastly different CPUs (ARM vs x86) and for different hardware levels (SSE/SSE2 as a baseline vs SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, BMI1, BMI2, LZCNT, POPCNT etc; and mixtures there-in). They also already need to consider microarchitectural differences to an extent (instructions may perform significantly worse on some older hardware; the cost of using unaligned vector loads/stores may be significantly greater or lesser; etc). Allowing them to explicitly check for the manufacturer and model/family isn't a great jump from there and is something that has already been requested by external customers.

jkotas commented 4 years ago

Would it be worthwhile for me to experiment with removing BMI2 from the transcoding code paths?

Yes, I think we should consider this fix for .NET Core 3.1 servicing.

jkotas commented 4 years ago

And the other variant of the fix to consider to servicing is to disable Bmi2 for AMD processors.

Each one of these fixes comes with a different risk profile and trade-offs. You can let servicing shiproom to pick which one they like better.

damageboy commented 4 years ago

@jkotas I do understand your point of view, but ultimately, once the intrinsics cat is out of the bag, a world of complexity comes with it. Sure, you could postpone supporting/dealing with this to some extent, but ultimately this PDEP/BMI2 is an early/light example of what is to come in the future:

Cache-lines will cease being homogenous as they were since 2003 (it seems like 128 byte cache-lines are coming down the pipe)
If/When AVX512 is supported, developers will want to detect and be able to take advantage / provide code paths that use the 16 extra architectural registers that the processor supports.
There are, except for PDEP, as @tannergooding pointed out, a wide variety of quirks associated with this world of bleeding edge instructions and different models from the same manufacturer exhibit wildly different results when instructed to execute specific instructions.

It seems very arbitrary to draw a line around what sort of complex view developers "get to see" accepting partial solutions like the IsSupported boolean property, while rejecting CPUID and its likes for the general case.

As an example, Let's assume that at some future point in time, CoreCLR will support AVX512 to some extent, if only to be able to support VNNI extensions for ML.NET or any of the 20 or so instruction families that come in future processors, simply because you will not be able to withstand the pressure to provide the associated performance/functionality that comes with it: Hell

I assume we'll all agree (?) that AVX512 support will be exposed through an imaginary AVX512F.IsSupported property, much like AVX2 is exposed today? What would prevent some future developer from detecting such AVX512 support through AVX512F.IsSupported property, using that as a "signal" to provide a different AVX2 code-path, under the assumption that the full set of 32 ymm0-31 registers would be supported by the JIT during code-generation? This is essentially an abuse of support you would have to provide in order to expose code-paths that you never intended to support (e.g. multiple code paths per # of AVX2 registers available on a given CPU). This is essentially equivalent to exposing the actual data through CPUID in a clear and complete way allowing developers to query the CPU how many registers are supported in an explicit way.

Querying the CPU manfacturer, through CPUID, as per #785, to detect an AMD processor, or querying cache latency / line sizes / structure shouldn't be considered as radically different than getting partial CPUID "bits" leaked out through the IsSupported flags.

I guess that my argument is, and perhaps this is also the problem with it, that you either accept your fate and the need to support these exotic instructions that come with their complexities because you wish to be able to play that game and compete with the programming language that do, or you don't. I personally don't see so much of a middle ground that makes sense. And I can easily see a variaty of ways where over-simplifying will ultimately fail.

Then again, I'm not the one tasked to provide and maintain this complexity, so ╮( ˘ ､ ˘ )╭

tannergooding commented 4 years ago

I worked with @GrabYourPitchforks and got some numbers on both AMD and Intel for the current code with BMI2 On vs Off.

For the first scenario, I ran the benchmark given in the original post, but covered three input strings:

The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog 😃
😃 The quick brown fox jumps over the lazy dog

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.200-preview-014883
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-AMFEXG : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Bmi2.IsSupported=False
|         Method |     Mean |    Error |   StdDev |
|--------------- |---------:|---------:|---------:|
|   WriteString1 | 15.25 ns | 0.368 ns | 0.410 ns |
|   WriteString2 | 19.16 ns | 0.080 ns | 0.075 ns |
|   WriteString3 | 29.95 ns | 0.166 ns | 0.156 ns |
|    ReadString1 | 31.41 ns | 0.282 ns | 0.264 ns |
|    ReadString2 | 42.04 ns | 0.282 ns | 0.264 ns |
|    ReadString3 | 50.46 ns | 0.320 ns | 0.300 ns |

Bmi2.IsSupported=True
|         Method |     Mean |    Error |   StdDev |
|--------------- |---------:|---------:|---------:|
|   WriteString1 | 14.18 ns | 0.065 ns | 0.061 ns |
|   WriteString2 | 19.35 ns | 0.082 ns | 0.068 ns |
|   WriteString3 | 25.60 ns | 0.098 ns | 0.091 ns |
|    ReadString1 | 31.67 ns | 0.235 ns | 0.220 ns |
|    ReadString2 | 41.63 ns | 0.811 ns | 0.719 ns |
|    ReadString3 | 46.27 ns | 0.436 ns | 0.364 ns |

vs

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 7 1800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-alpha1-015753
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT  [AttachedDebugger]
  Job-PEKDQT : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Runtime=.NET Core 3.1

Bmi2.IsSupported=False
|         Method |     Mean |    Error |   StdDev |
|--------------- |---------:|---------:|---------:|
|   WriteString1 | 14.37 ns | 0.104 ns | 0.093 ns |
|   WriteString2 | 30.72 ns | 0.059 ns | 0.056 ns |
|   WriteString3 | 31.78 ns | 0.424 ns | 0.396 ns |
|    ReadString1 | 38.52 ns | 0.422 ns | 0.352 ns |
|    ReadString2 | 59.28 ns | 0.523 ns | 0.464 ns |
|    ReadString3 | 70.88 ns | 0.558 ns | 0.495 ns |

Bmi2.IsSupported=True
|         Method |      Mean |    Error |   StdDev |
|--------------- |----------:|---------:|---------:|
|   WriteString1 |  93.32 ns | 0.312 ns | 0.291 ns |
|   WriteString2 |  84.74 ns | 0.315 ns | 0.295 ns |
|   WriteString3 | 424.22 ns | 2.307 ns | 2.158 ns |
|    ReadString1 |  95.77 ns | 0.941 ns | 0.880 ns |
|    ReadString2 | 137.21 ns | 0.931 ns | 0.777 ns |
|    ReadString3 | 509.57 ns | 2.685 ns | 2.512 ns |

tannergooding commented 4 years ago

Then, we also tested a more "real-world" scenario, where the benchmark data comes from: https://github.com/GrabYourPitchforks/fast-utf8/tree/master/FastUtf8Tester/SampleTexts

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;
using System.IO;
using System.Reflection;
using System.Text;

namespace ConsoleAppBenchmark
{
    public class TranscodingRunner
    {
        private static readonly string SampleTextsFolder = Path.GetDirectoryName(Assembly.GetEntryAssembly().Location);

        private byte[] _dataAsUtf8;
        private char[] _dataAsUtf16;

        [Params("11.txt", "11-0.txt", "25249-0.txt")]
        public string Corpus;

        [GlobalSetup]
        public void Setup()
        {
            _dataAsUtf8 = File.ReadAllBytes(Path.Combine(SampleTextsFolder, Corpus));

            if (_dataAsUtf8.AsSpan().StartsWith(Encoding.UTF8.Preamble))
            {
                _dataAsUtf8 = _dataAsUtf8[Encoding.UTF8.Preamble.Length..];
            }

            _dataAsUtf16 = Encoding.UTF8.GetChars(_dataAsUtf8);
        }

        [Benchmark]
        public void TranscodeUtf16ToUtf8()
        {
            byte[] rented = ArrayPool<byte>.Shared.Rent(_dataAsUtf8.Length);
            Encoding.UTF8.GetBytes(_dataAsUtf16, rented);
            ArrayPool<byte>.Shared.Return(rented);
        }

        [Benchmark]
        public void TranscodeUtf8ToUtf16()
        {
            char[] rented = ArrayPool<char>.Shared.Rent(_dataAsUtf16.Length);
            Encoding.UTF8.GetChars(_dataAsUtf8, rented);
            ArrayPool<char>.Shared.Return(rented);
        }

        static void Main(string[] args)
        {
            BenchmarkRunner.Run<TranscodingRunner>();
        }
    }
}

tannergooding commented 4 years ago

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.200-preview-014883
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-AMFEXG : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Bmi2.IsSupported=False
|               Method |      Corpus |      Mean |    Error |   StdDev |
|--------------------- |------------ |----------:|---------:|---------:|
| TranscodeUtf16ToUtf8 |    11-0.txt |  86.82 us | 1.037 us | 0.970 us |
| TranscodeUtf8ToUtf16 |    11-0.txt |  88.18 us | 0.510 us | 0.426 us |
| TranscodeUtf16ToUtf8 |      11.txt |  11.69 us | 0.229 us | 0.306 us |
| TranscodeUtf8ToUtf16 |      11.txt |  11.42 us | 0.133 us | 0.125 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt | 111.47 us | 0.460 us | 0.384 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 129.61 us | 0.418 us | 0.391 us |

Bmi2.IsSupported=True
|               Method |      Corpus |      Mean |    Error |   StdDev |
|--------------------- |------------ |----------:|---------:|---------:|
| TranscodeUtf16ToUtf8 |    11-0.txt |  78.98 us | 0.623 us | 0.552 us |
| TranscodeUtf8ToUtf16 |    11-0.txt |  77.29 us | 0.429 us | 0.358 us |
| TranscodeUtf16ToUtf8 |      11.txt |  11.75 us | 0.234 us | 0.336 us |
| TranscodeUtf8ToUtf16 |      11.txt |  11.66 us | 0.117 us | 0.110 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |  98.70 us | 0.358 us | 0.317 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 113.68 us | 0.604 us | 0.565 us |

vs

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 7 1800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-alpha1-015753
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT  [AttachedDebugger]
  Job-PEKDQT : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Bmi2.IsSupported=False
|               Method |      Corpus |       Mean |     Error |    StdDev |
|--------------------- |------------ |-----------:|----------:|----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt |  80.670 us | 0.3441 us | 0.3050 us |
| TranscodeUtf8ToUtf16 |    11-0.txt |  82.702 us | 1.5214 us | 1.3487 us |
| TranscodeUtf16ToUtf8 |      11.txt |   8.739 us | 0.0231 us | 0.0193 us |
| TranscodeUtf8ToUtf16 |      11.txt |   9.061 us | 0.0592 us | 0.0525 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |  91.262 us | 0.4225 us | 0.3952 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 113.486 us | 0.7630 us | 0.7137 us |

Bmi2.IsSupported=True
|               Method |      Corpus |         Mean |     Error |    StdDev |
|--------------------- |------------ |-------------:|----------:|----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt | 1,450.936 us | 7.2278 us | 6.4073 us |
| TranscodeUtf8ToUtf16 |    11-0.txt | 1,571.052 us | 6.3881 us | 5.9754 us |
| TranscodeUtf16ToUtf8 |      11.txt |     7.969 us | 0.0263 us | 0.0220 us |
| TranscodeUtf8ToUtf16 |      11.txt |     9.045 us | 0.0578 us | 0.0482 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |   263.277 us | 1.1028 us | 0.9209 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 1,270.980 us | 2.5484 us | 2.2591 us |

tannergooding commented 4 years ago

I can also provide numbers against a AMD Ryzen 9 3900X 1 CPU, 24 logical and 12 physical cores if needed; but I suspect the numbers will be similar.

jkotas commented 4 years ago

the need to support these exotic instructions that come with their complexities

We have dealt with our own share of complexities to deal with the varying performance characteristics of primitives, like the pause instruction. We ended up measuring how it behaves at runtime and adjusting the runtime behavior based on that. In theory, you may be able to do that by looking at CPUIDs and what not. In practice, we found that it is impossible to get it right for all possible variants.

The SIMD intrinsics are even harder to get right. I believe that there is negative value in trying to teach people how to conditionalize those using CPUIDs. There is no way these condition would be right, and then we would just have end up with a pile of problems every time a new hardware generation shows up.

benaadams commented 4 years ago

There is currently no IsAuthenticAMD api or an ExtendedModelId api so end users cannot use this in their code with an BMI2.IsSupported check; so that currently isn't a possibility anyway?

jkotas commented 4 years ago

There is currently no IsAuthenticAMD api or an ExtendedModelId api so end users cannot use this in their code with an BMI2.IsSupported check

Right. And I do not believe that we want to add APIs like this to the core platform. If somebody believes that they really need them, they can build them on their own.

GrabYourPitchforks commented 4 years ago

I removed the BMI2 calls from the hot path and replaced them with SIMD instructions instead. On my Intel box it results in some small perf wins.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015755
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.19.56303, CoreFX 5.0.19.56306), X64 RyuJIT
  Job-PDISED : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-PIYPPK : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Method	Toolchain	Corpus	Mean	Error	StdDev	Ratio	RatioSD
TranscodeUtf16ToUtf8	master	11-0.txt	93.44 us	2.017 us	1.788 us	1.00	0.00
TranscodeUtf16ToUtf8	proto	11-0.txt	87.64 us	0.750 us	0.702 us	0.94	0.02

I haven't yet run the full test battery over it, so I don't know how it will respond for smaller payloads. And I don't have access to an AMD box for testing right now. But nevertheless it's promising.

Edit: The particular change that gave these numbers is https://github.com/GrabYourPitchforks/runtime/commit/f35512820e0e5efbecb0943224dfe11bdcf249c9.

damageboy commented 4 years ago

I feel bad for doing my share in hijacking this issue, so I'll answer @jkotas on #785 Apologies to @GrabYourPitchforks and others.

EgorBo commented 3 years ago

Looks like PDEP/PEXT aren't a problem on new AMD processors (Zen3) ? https://twitter.com/0x22h/status/1319177288039104517

dotnet / runtime

Performance Regression: UTF8 Span Methods 3x slower on AMD Platforms (NetCore2.1 -> 3.1) #2251