Open fiigii opened 6 years ago
@fiigii, do you have any metrics for what time is spent in ucrtbase.dll
(as well as RayTracer.dll
and libm-2.23.so
)? I wonder if the metrics are partially "skewed" due to different inlining characteristics of the math libraries/etc.
@tannergooding Here is the VTune data of CRT
The CPI of pow
on Windows looks healthy, the issue may be from its algorithm/implementation.
cc @AndyAyersMS @CarolEidt @jkotas
the issue may be from its algorithm/implementation.
@fiigii, that would be a bit surprising. I'm looking at the implementation and it is some fairly heavily optimized FMA3 code (there is also an SSE2 code path, but you shouldn't hit that).
Unfortunately, that implementation is closed source, so I can't share it here.
I'm trying to collect a trace locally as well, to see if I get the same.
is some fairly heavily optimized FMA3 code
Right, I saw it also from disasm.
@fiigii, how was CoreCLR compiled for you?
I'm testing on only a i7-7700 @ 3.6 GHz
and the run completes in just 1.231s (as compared to the nearly 6s the above shows) -- This is for Baseline (AoS))
I'm testing on only a i7-7700 @ 3.6 GHz and the run completes in just 1.231s (as compared to the nearly 6s the above shows) -- This is for Baseline (AoS))
I changed the image size from 250x250 to 2480x2480 and just rendered one image (not using RenderLoop) to avoid the collection pool imapct for the profling. https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/RayTracerBench.cs#L33-L34 Sorry for the unclarity.
just rendered one image (not using RenderLoop) to avoid the collection pool imapct for the profling.
So, to clarify, you changed just the following?:
-private const int Width = 250;
-private const int Height = 250;
-private const int Iterations = 7;
+private const int Width = 2480;
+private const int Height = 2480;
+private const int Iterations = 1;
Similar, but I did not use ObjectPool
https://github.com/fiigii/PacketTracer/blob/master/baseline/RayTracer/RayTracerBench.cs#L120-L140
Looks like glibc recently (07 AUG 2017) made a few changes: https://sourceware.org/git/?p=glibc.git;a=commit;h=57a72fa3502673754d14707da02c7c44e83b8d20
Namely, they still use the IBM Accurate Mathematical Library
as their root source code, however, they now have some new logic which additionally compiles that code with the -mfma
and -mavx2
flags, which provides some automatic transformations/optimizations (it looks like they do a cached CPUID check at runtime and jump to the appropriate code).
Additionally, it looks like, since the calling conventions map up, they generally end up calling libm-2.27.so~__pow
directly, rather than having an intermediate call through COMDouble::Pow
.
CC. @CarolEidt, @AndyAyersMS, @jkotas
Does CoreCLR not JIT methods with the platform calling convention?
@roterdam, it does. However, the backing implementations for most System.Math
functions aren't managed code, they are "FCALLs" to the underlying C Runtime implementation (through the COMDouble
and COMSingle
classes).
@tannergooding Were you able to reproduce this issue? Is this something we can do or do we need to loop in the VC++ team?
@AaronRobinsonMSFT. Yes, I was able to reproduce this.
I believe this is already tracked by one of the C++ bugs I logged internally, but I will double-check and log a new one if not.
This is also part of a bigger picture with System.Math/MathF that is being tracked internally. I can share more details offline if necessary.
@tannergooding Not necessary. My main goal was simply to set milestones for issues tagged with VM. If this is something that is post-3.0, then feel free to tag as needed. If 3.0, do we have a plan to deliver it?
Tagging subscribers to this area: @tannergooding Notify danmosemsft if you want to be subscribed.
During benchmarking AoS/SoA ray-tracer https://github.com/dotnet/coreclr/pull/18839, we found that the
Vector3
benchmark (RayTracer) is much slower on Windows than Linux.According to VTune analysis, this gap is caused by the CRT math library, which RayTracer uses
Math.Pow
at https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/Raytracer.cs#L153Windows
Linux
On the left side (AoS means RayTracer), we can see
ucrtbase.dll
on Windows has much more time consuming and instruction retired thanlibm-2.23.so
on Linux.The data is collected on Core i9 + VS2017, but Core i7+ VS2015 has the same performance gap.