dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

decimal performance regression on x86 for netcore 3 #12286

Open Daniel-Svensson opened 5 years ago

Daniel-Svensson commented 5 years ago

It seems performance for decimal arithmetics has decreased significantly (~50%-80% slower in simple cases and more in others) since moving to the managed implementation (I believe the relevant method PR https://github.com/dotnet/coreclr/pull/18948). It seems x64 performance has increased almost as much as the improved code proposed some time back

Benchmark is availible at https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks/Scenarios/InterestBenchmark.cs

First approach with to large number was availible at https://github.com/Daniel-Svensson/ClrExperiments/blob/978343a887c305de931a2caef442af89297a7362/ClrDecimal/Benchmarks/Scenarios/InterestBenchmark.cs With results here

It is a simplified version from part of of real world code from a financial application. In essense is just increases the interest on an account based on current interest rate similar to code below for 100 000 items.

var dayFactor = 1m / array[i].DaysInYear;

array[i].NewInterest = array[i].CurrentInterest + array[i].Amount * (array[i].CurrentInterestRate * dayFactor );

The actual workload does valuation of different financial instruments (loans, etc.) and it needs to use the forecasted interest rate (which in turn are interpolated between different values so they forecasted/expected interest rate will not be rounded). The actual workload has much more operations on the result of the "interest" calculation all of which will get executed using "full precision" in the large majority of cases.

Measurements

Results for different runtimes can be found under https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks/results

Performance in netcore 3

Updated 2018-06-13


BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core ? (CoreCLR 4.6.27422.72, CoreFX 4.7.19.12807), 32bit RyuJIT

Job=ShortRun  Toolchain=InProcessToolchain  IterationCount=3  
LaunchCount=1  WarmupCount=3  
Method RoundedAmounts SmallDivisor Count Mean Error StdDev
System.Decimal False False 100000 50.42 ms 11.8249 ms 0.6482 ms
'P/Invoke New C++' False False 100000 33.27 ms 8.6233 ms 0.4727 ms
'P/Invoke oleauto32' False False 100000 30.25 ms 0.9500 ms 0.0521 ms
System.Decimal False True 100000 46.80 ms 13.5972 ms 0.7453 ms
'P/Invoke New C++' False True 100000 30.32 ms 0.1360 ms 0.0075 ms
'P/Invoke oleauto32' False True 100000 28.78 ms 1.9610 ms 0.1075 ms
System.Decimal True False 100000 37.30 ms 10.4275 ms 0.5716 ms
'P/Invoke New C++' True False 100000 29.47 ms 0.1789 ms 0.0098 ms
'P/Invoke oleauto32' True False 100000 26.36 ms 6.1196 ms 0.3354 ms
System.Decimal True True 100000 32.52 ms 5.5799 ms 0.3059 ms
'P/Invoke New C++' True True 100000 26.54 ms 0.1219 ms 0.0067 ms
'P/Invoke oleauto32' True True 100000 24.07 ms 0.0689 ms 0.0038 ms ### Some clarifications

Net core 2.2

Important: these results are from the first version of the benchmark, so different scales

Performance in netcore 2.2 is very similar to net framework


BenchmarkDotNet=v0.11.4, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core ? (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 32bit RyuJIT

Job=MediumRun  Toolchain=InProcessToolchain  IterationCount=15  
LaunchCount=2  WarmupCount=10  
Method RoundedAmounts SmallDivisor Count Mean Error StdDev Median
System.Decimal False False 100000 38.041 ms 0.2255 ms 0.3305 ms 38.072 ms
Oleauto32 False False 100000 30.787 ms 0.1140 ms 0.1635 ms 30.746 ms
System.Decimal False True 100000 35.373 ms 0.2204 ms 0.3090 ms 35.327 ms
Oleauto32 False True 100000 29.285 ms 0.2590 ms 0.3631 ms 29.084 ms
System.Decimal True False 100000 34.526 ms 0.1504 ms 0.2205 ms 34.426 ms
System.Decimal True True 100000 31.612 ms 0.2084 ms 0.3119 ms 31.511 ms
New True True 100000 29.316 ms 0.1713 ms 0.2286 ms 29.147 ms
Daniel-Svensson commented 5 years ago

@jkotas , @pentp you were part of the discussion in dotnet/runtime#7778

As I tried to communicate there I would be happy to port my performance improvements to the c++ code (with up to over 300% improvements to x64 and over 100% improvements to x86 worst case performance) to C# but would need some instrinct support for

pentp commented 5 years ago

Is the NetFramework result the one for .NET Core?

The test data looks a bit contrived (interest calculation for 100 trillion, maybe in Venezuela only?). I focused optimizations on more realistic numbers (weighted towards smaller numbers with less decimal places).

The setup method Random does not return a value between min/max (it should return rnd * range + min).

Daniel-Svensson commented 5 years ago

Is the NetFramework result the one for .NET Core?

Yes, will try to change it to system.decimal when on a computer

The test data looks a bit contrived (interest calculation for 100 trillion, maybe in Venezuela only?). I focused optimizations on more realistic numbers (weighted towards smaller numbers with less decimal places).

Input can probably be divided by 100 for a little more realistic case, millions and sometime billions is not so uncommon for large Corp (non usd/eur currencies). But if I remember correctly it will not make any large difference since the day factor will cause "full precision" numbers and the example is just actually the starting point for many other calculations.

The setup method Random does not return a value between min/max (it should return rnd * range + min).

Thanks for the input will look into that at a later point

danmoseley commented 5 years ago

@tannergooding can you take a look at this problem and share your thoughts?

pentp commented 5 years ago

I think the benchmark needs some adjustments:

Currently this is testing the most expensive paths for decimal - all 96bits / 28 decimal places used + very high scaling (20+).

danmoseley commented 5 years ago

@tannergooding thoughts? Is this needed for 3.0?

tannergooding commented 5 years ago

I would agree with @pentp. The benchmark doesn't appear to be representative of what would be likely for real world scenarios (the range on the inputs/interest rates is too large) and I would like to see it updated and for new numbers to be posted.

I don't think there is anything particular pressing here for 3.0. Although it would be good to validate we haven't egregiously regressed the perf for x86.

danmoseley commented 5 years ago

@tannergooding wrt regressions the key thing is to make sure there's appropriate microbenchmarks in the dotnet/performance repo. If so, we will automatically catch any regressions in our next pass. Even if the benchmark didn't exist in the last release timeframe. (And perhaps this can be closed)

pentp commented 5 years ago

The change to managed decimal was done with careful performance measurements and some regressions for 32-bit were deemed acceptable (the unmanaged implementation was optimized for 32-bit). For 64-bit there are only a couple of odd corner cases that are slower, but real world use cases should actually see significant improvement.

Daniel-Svensson commented 5 years ago

I did some updates to the benchmarks some time ago, but it seems my update post did not make it here. I have updated the description with the results from the benchmarks with much smaller numbers.

The important thing about this simplified example is that it is far from the worse case scenario, but it still triggers "bad performance.

As soon as you start dividing numbers or take input from doubles you will very quickly get "full precision" numbers and performance regressions. There will certainly be many types of workloads where full precision numbers are the majority and not the exception

Since I believe that also worst/bad case performance is important I really think you should also measure those and not assume almost all input is limited to 32bit precision.

ghost commented 4 years ago

Tagging subscribers to this area: @tannergooding Notify danmosemsft if you want to be subscribed.