CRT Pow function has bad performance on Windows

fiigii commented 6 years ago

During benchmarking AoS/SoA ray-tracer https://github.com/dotnet/coreclr/pull/18839, we found that the Vector3 benchmark (RayTracer) is much slower on Windows than Linux.

Execution time	Windows	Linux
Baseline (RayTracer )	6.00s	4.13s
PacketTracer	1.20s	1.35s
Performance Gains	5.00x	3.06x

According to VTune analysis, this gap is caused by the CRT math library, which RayTracer uses Math.Pow at https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/Raytracer.cs#L153

Windows

Linux

On the left side (AoS means RayTracer), we can see ucrtbase.dll on Windows has much more time consuming and instruction retired than libm-2.23.so on Linux.

The data is collected on Core i9 + VS2017, but Core i7+ VS2015 has the same performance gap.

tannergooding commented 6 years ago

@fiigii, do you have any metrics for what time is spent in ucrtbase.dll (as well as RayTracer.dll and libm-2.23.so)? I wonder if the metrics are partially "skewed" due to different inlining characteristics of the math libraries/etc.

fiigii commented 6 years ago

@tannergooding Here is the VTune data of CRT

fiigii commented 6 years ago

The CPI of pow on Windows looks healthy, the issue may be from its algorithm/implementation.

fiigii commented 6 years ago

cc @AndyAyersMS @CarolEidt @jkotas

tannergooding commented 6 years ago

the issue may be from its algorithm/implementation.

@fiigii, that would be a bit surprising. I'm looking at the implementation and it is some fairly heavily optimized FMA3 code (there is also an SSE2 code path, but you shouldn't hit that).

Unfortunately, that implementation is closed source, so I can't share it here.

tannergooding commented 6 years ago

I'm trying to collect a trace locally as well, to see if I get the same.

fiigii commented 6 years ago

is some fairly heavily optimized FMA3 code

Right, I saw it also from disasm.

tannergooding commented 6 years ago

@fiigii, how was CoreCLR compiled for you?

I'm testing on only a i7-7700 @ 3.6 GHz and the run completes in just 1.231s (as compared to the nearly 6s the above shows) -- This is for Baseline (AoS))

fiigii commented 6 years ago

I'm testing on only a i7-7700 @ 3.6 GHz and the run completes in just 1.231s (as compared to the nearly 6s the above shows) -- This is for Baseline (AoS))

I changed the image size from 250x250 to 2480x2480 and just rendered one image (not using RenderLoop) to avoid the collection pool imapct for the profling. https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/RayTracerBench.cs#L33-L34 Sorry for the unclarity.

tannergooding commented 6 years ago

just rendered one image (not using RenderLoop) to avoid the collection pool imapct for the profling.

So, to clarify, you changed just the following?:

-private const int Width = 250;
-private const int Height = 250;
-private const int Iterations = 7;
+private const int Width = 2480;
+private const int Height = 2480;
+private const int Iterations = 1;

fiigii commented 6 years ago

tannergooding commented 6 years ago

Looks like glibc recently (07 AUG 2017) made a few changes: https://sourceware.org/git/?p=glibc.git;a=commit;h=57a72fa3502673754d14707da02c7c44e83b8d20

Namely, they still use the IBM Accurate Mathematical Library as their root source code, however, they now have some new logic which additionally compiles that code with the -mfma and -mavx2 flags, which provides some automatic transformations/optimizations (it looks like they do a cached CPUID check at runtime and jump to the appropriate code).

Additionally, it looks like, since the calling conventions map up, they generally end up calling libm-2.27.so~__pow directly, rather than having an intermediate call through COMDouble::Pow.

CC. @CarolEidt, @AndyAyersMS, @jkotas

roterdam commented 6 years ago

Does CoreCLR not JIT methods with the platform calling convention?

tannergooding commented 6 years ago

@roterdam, it does. However, the backing implementations for most System.Math functions aren't managed code, they are "FCALLs" to the underlying C Runtime implementation (through the COMDouble and COMSingle classes).

AaronRobinsonMSFT commented 5 years ago

@tannergooding Were you able to reproduce this issue? Is this something we can do or do we need to loop in the VC++ team?

tannergooding commented 5 years ago

@AaronRobinsonMSFT. Yes, I was able to reproduce this.

I believe this is already tracked by one of the C++ bugs I logged internally, but I will double-check and log a new one if not.

This is also part of a bigger picture with System.Math/MathF that is being tracked internally. I can share more details offline if necessary.

AaronRobinsonMSFT commented 5 years ago

@tannergooding Not necessary. My main goal was simply to set milestones for issues tagged with VM. If this is something that is post-3.0, then feel free to tag as needed. If 3.0, do we have a plan to deliver it?

ghost commented 4 years ago

Tagging subscribers to this area: @tannergooding Notify danmosemsft if you want to be subscribed.

dotnet / runtime

CRT Pow function has bad performance on Windows #10798

Windows

Linux