Daniel-Svensson commented 7 years ago

The current implementation on forwards call to windows implementation of "decimal" (VarDecAdd, VarDecSub, VarDecMul) or to the port of these functions found under palrt (as far as I understand the code).

These methods are written optimised for 32bit platforms and by using 64bit instructions it is possible to significantly improve performance.

I have written a small proof of concept to illustrate the gains I is looking forward to try to integrate the code with coreclr, but have some questions and want feedback on how to best integrate it before submitting a PR with those methods.

Questions:

Can these improvements make its way to desktop clr?
Which methods are most common? I would assume that apart from the basic aritmethic operations that conversions to/from text as well as double/int can be quite common.
What methods have the most to gain for this ? I have on +,-,/ and * for now but there might be some other low hanging fruit.
How to best integrate it in coreclr code? I am thinking along the way of keeping _x64 suffix on these methods (VarDecAdd_x64) and only include the code for x64 platforms. This would be be coupled with a macro to redifine VarDecAdd as VarDecAdd_x64 to route all calls to the x64 implementation.
I am not really proficient with cmake so if I run into problems with the integration it would be greate if someone was willing to help.

Proof of Concept

I have created x64 aware methods for the aritmetic instructions Add, Sub, Mul and Div based on the current code in coreclr, there are not real changes to algoritms or other logic apart from changing 32bit aritmetic to 64bit and some of the results of that. And using some instrincts for bitsearch and carry propagation.

This is a summary of the measurements from example projekt which can be found at: https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting

Measurment

https://github.com/Daniel-Svensson/ClrExperiments/blob/master/ClrDecimal/coreclrtesting/main.cpp

In short program generates a number of "semi random" input where different number of bits are set. Then it calls the method under test for all combinations of the input.

Results:

See the results folder (https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting/results) for complete output results. I have tried to summarize results for both core i5 2500K and i7 6700K below. I5 result are with a few minor changes but otherwise same code , the biggest performance spread is in some of the division tests.

The results below are against the oleauto32 implementation, when compared against implementations in palrt the results are very similar since those results are within a few % of the timings for oleauto32.

Multiply

Measurements removed, se post below for updated values *

Add / Sub

Measurements removed, se post below for updated values *

Div

Speedup range: 10-270% For mixed input (all 00...111 bitpatterns, with all scales and signs): ~100%

Measurment	Speedup
32 x 32 bit	>50%
32 x 32 bit with scale	~37%
64 x 64 bit no scale	109-118%
64 x 64 bit varying scale	94-
96 x 96 bit	>102%

danmoseley commented 7 years ago

@janvorli @mellinoe

janvorli commented 7 years ago

@Daniel-Svensson thank you for looking into this! The numbers from your experiments look really great.

As for your questions:

Can these improvements make its way to desktop clr?

It would first have to be tested if these changes would improve desktop. Since on desktop, we are calling into OLEAUT32, it is possible that the code in there was already optimized for 64 bits. @jkotas, I guess that if the perf gains are similar on Windows, we would want to port the change to desktop, right?

Which methods are most common? I would assume that apart from the basic aritmethic operations that conversions to/from text as well as double/int can be quite common.

It really depends on the actual applications. Decimals are quite special.

What methods have the most to gain for this ? I have on +,-,/ and * for now but there might be some other low hanging fruit.

I would stay with these four operations. If you look at the list of methods on decimal implemented by the native runtime (https://github.com/dotnet/coreclr/blob/master/src/vm/ecalllist.h#L834-L849), you'll see that they all use the VarDecXXX functions that you've already optimized.

How to best integrate it in coreclr code? I am thinking along the way of keeping _x64 suffix on these methods (VarDecAdd_x64) and only include the code for x64 platforms. This would be be coupled with a macro to redifine VarDecAdd as VarDecAdd_x64 to route all calls to the x64 implementation.

I wonder if we could somehow unify the code for 32 and 64 bit. What if we implemented the 64 bit intrinsics for 32 bit just in C and see if the perf stays the same for 32 bits. Just from looking at a sample of your functions, it looks like it might be possible. If we cannot do that, I would recommend simply creating a new file called e.g. decarith64.cpp and put all the implementation in there. Then the build would use decarith.cpp for 32 bit builds and decarith64.cpp for 64 bit builds. If there are common functions that share implementation, then we could have three of these (e.g. decarith.cpp, decarith32.cpp and decarith64.cpp).

I am not really proficient with cmake so if I run into problems with the integration it would be greate if someone was willing to help.

I'd be happy to give you a guidance in cmake stuff.

janvorli commented 7 years ago

I also have a question - will the intrinsics used in your changes work for Unix ARM64?

jkotas commented 7 years ago

@jkotas, I guess that if the perf gains are similar on Windows, we would want to port the change to desktop, right?

Desktop is calling the implementation of these methods in oleaut32. I doubt that fixing a oleaut32 perf problem by duplicating its implementation would fly for desktop. It would most likely turn into suggestion to Windows to fix the perf problem in oleaut32.

Ideally, we would have fully managed implementations on these methods in C# in .NET Core. There is C# implementation of these methods in CoreRT: https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/src/System/Decimal.DecCalc.cs . I would recommend focusing the optimization effort on this C# implementation, and once it is clearly better than what we have in CoreCLR today, switch CoreCLR to use it as well.

janvorli commented 7 years ago

Ah ok, I've not realized we already have a managed implementation in CoreRT. So I agree with @jkotas that we should focus on improving the managed implementation. @Daniel-Svensson using your measurements would be a good estimate on how far we could possibly get.

Daniel-Svensson commented 7 years ago

Just some quick replies, I will try to get some time this weekend to address most of the other comments

Coreclr and managed implementations

@janvorli @jkotas

Regarding focusing on the managed version: Being able to use the full with of 64bit instructions, that is 64bit multiply with 128bit results and division of 128bit by 64 bit was key to the performance improvements. That allowed most of the reduction in logic and cut the expected number of iterations in the loop in 2. Without that I get it is possible to get maybe at most double-digit improvements.

It should be possible to get a few percentage improvements if we somehow get access to bit scanning operations, but without 128bit results I see no way to come anywhere close most of these numbers.

ARM64

I don't think the instrincts works as they are right now for ARM64, but most of them should hopefully be able to fix easily. There are currently a few other assumptions embedded in the code such as being able to cast DWORD64 to DWORD32 back and forth having index 2 (and 0 etc.) in 32bit array to refer to lower 32bit of the corresponding 64bit word.

That said the code can probably be used as a basis for arm64. The corresponding instructions (add, add with carry etc.) as well as multiplication of two unsigned 64bit words producing 128-bit result (2 instructions UMULL, UMULH) seems to be availible. I am however unsure if there is an efficient way to do for "128" by 64 divide in arm64 (64-bit divide with "carry")

source: https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

Unifying 32 and 64bit

A reasonable amount of the code can probably be shared such some of the helper methods, but the impact in 32bit mode must be measured. While it would be possible to just implement some of the instructions as functions on 32bit/(non x86) it is quite hard to tell beforehand what the performance impact would be.

The original implementation seems quite optimized for 32bit with more short-circuits and special cases than required in the prototype. I do hover think it might be worth it to measure the impact of unifying some of the methods, they will not be optimal for x86 but it is possible that some may improve or have only a very small impact.

Mixed comments

Since on desktop, we are calling into OLEAUT32, it is possible that the code in there was already optimized for 64 bits.

My numbers are primary measurements against the oleaut32 implementations om 64-bit Windows (with coreclr/palrt as separate measurements against the code in decimal/decarith.cpp). I was assuming that https://github.com/dotnet/coreclr/blob/master/src/classlibnative/bcltype/decimal.cpp are the code which https://github.com/dotnet/coreclr/blob/master/src/vm/ecalllist.h#L834-L849 refers to. So it does not look like they are optimized in Windows 10 Anniversary and I doubt that changes soon.

Desktop is calling the implementation of these methods in oleaut32. I doubt that fixing a oleaut32 perf problem by duplicating its implementation would fly for desktop. It would most likely turn into suggestion to Windows to fix the perf problem in oleaut32.

I would love to have the code integrated to get the benefits, but Windows is not exactly OSS so that seems like a dead end to me. I have seen a few requests to have the performance tuned for x64 with maybe the first dating back to the days of Windows Vista. However, having the code in windows would only improve windows desktop while leaving the other platforms left behind.

Where to insert the code (if anywhere) and additional question?

Where to update I guessed that the correct place would be to update classlibnative/bcltype, I see no other way for the changes to take effect on windows otherwise. palrt and decarith*.cpp sounds like it is only used for some platforms and I did not see any project reference the files when compiling using VS 2017 on windows.
Decimal divide Currently there are 2 variants of division in decimal.cpp (both DoDivideThrow and DoDivide) as well as VarDecDiv in decarith.cpp (palrt) a; Do you know why there are so many different implementations (the first 2 does not seem to have better performance than VarDecDiv)? b; What about rewriting DoDivideThrow and DoDivide so that they both utilize VarDecDiv instead?

Daniel-Svensson commented 7 years ago

I wonder if we could somehow unify the code for 32 and 64 bit. What if we implemented the 64 bit intrinsics for 32 bit just in C and see if the perf stays the same for 32 bits.

@janvorli I tried the approach and it seems to work fine after some minor tweaking. I now have mostly unified code for x64 and 32bit (with and without x86 asm).

I also have a question - will the intrinsics used in your changes work for Unix ARM64?

Not directly. It should be possible to implement at least most of them them quite efficently for bot ARM64 and AMR32.

BitScanReverse
- BitScanForward is defined in PAL but did not find reverse. It shoud be possible to use __builtin_ctz to implement it.
add / sub with carry
- clang seems to have builtins such as __builtin_addc , __builtin_addcll (and sub counterparts)
umul128 - the current 32bit code uses no instrints so it should be portable
_udiv64 - It has a plain c implementation so it should work fine

I would love to have the code integrated to get the benefits, but Windows is not exactly OSS so that seems like a dead end to me.

Feel free to contact me if anybody is intereseted in using my work to improve the windows/"oleaut" implementation.

Updated measurements

I have made very little tweaking to the 32bit implementation, there might be some additional easy things to do in order to increase performance. These measurements were made running a intel core i5 2500k, other architectures will probably get slightly different numbers.

x64 - the unified code compiled for amd64
"x86" - 32bit version of the code, with additional inline asm to do ful 32bit "div with remainder"
"w32 - no asm" - 32bit version of the code, no inline asm This is the closes measurements I have to how much the managed implementation can improve given that that add/sub-with-carry gets added as JIT instrincts for manged code to use. jkotas you might be interested in these measurements.

Mul	x64 - previous version	x64	x86	w32 - no asm
compared against	oleaut	oleaut	coreclr	coreclr
32 x 32 bit	>17%	14%	-1%	-2%
32 x 32 bit with scale	>127%	126%	4%	-5%
64 x 64 -> 64 bit result	>195%	>195%	55%	43%
64 x 64 -> 128 bit result	144	137%	26%	25%
64 x 64 -> 128 bit result with rescale	124	158%	67%	11%
96 x 96 bit with overflow	>270	219%	5%	6%
96 x 96 bit no overflow	>210%	279%	117%	10%
Mixed input (all 00...111 bitpatterns, with all scales and signs)	~160%	226%	99%	8%
Add
32 x 32 bit	11%	~9%	20%	17%
32 x 32 bit with scale	21%	~30%	6%	7%
64 + 32 bit varying scale	60	82%	5%	-3%
64 x 64 bit no scale	31	31%	20%	17%
64 x 64 bit varying scale	42	36%	8%	9%
96 x 96 bit	67-90%	105-128%	70-78%	17%
Mixed input (all 00...111 bitpatterns, with all scales and signs)	110%	153%	70%	14%
Sub
32 x 32 bit	1%	4%	3%	0
32 x 32 bit with scale	20%	30%	5%	6%
64 + 32 bit varying scale	64%	85%	4%	-2%
64 x 64 bit no scale	1%	4%	2%	0
64 x 64 bit varying scale	36%	37%	6	8%
96 x 96 bit	70-97%	110-132%	65-75%	18%
Mixed input (all 00...111 bitpatterns, with all scales and signs)	110%	153%	69%	14%

Where to go from here?

I would prefer to update try and update the current implementation and see what the final perf impact would be.

It would be better than today The code is already there in the coreclr repo and updating/replacing it with a version with better performance on both 32bit and 64 runtimes while also hopefully making it a tad more readable.
Better performance
- Given the current features availible to a managed implementation I dont se how it would be able to
- That said there are probably many functions which could be managed only, but just as with other functionality such as math functions, vector etc the JIT knows about these methods have much to gain from a native implementation for the time being.

janvorli commented 7 years ago

@Daniel-Svensson I am really sorry for not responding for such a long time. I have seen your update and wanted to discuss it further with @jkotas, but I have completely forgotten about it. These improvements are huge. @jkotas can you please take a look at the updated results above? My thinking here is that as a next step it would be great to measure the performance of the managed implementation @jkotas has referred to and add it to the tables above to get an idea on how it stands and if it looks like there is a chance to get close to these numbers with managed implementation.

Daniel-Svensson commented 7 years ago

@janvorli @jkotas I have no hurry with this issue, but I am hoping for this year. there are some minor cleanup left to do but I am not working much on this at the moment. I have however updated the divide implementation to also support 32bit code and might post updated performance numbers withing the next weeks given a rainy day.

It should be possible to add the managed implementation to the benchmark project in order to compare it.

I've added a mananged c++ wrapper for the new code add added a few simple benchmarks as a start.

new divide measurments

Div	x64 -previous	x64	x86	x86 - no asm
compared against	oleaut	oleaut	coreclr	coreclr
32 x 32 bit	50%	84%	49%	3%
32 x 32 bit with scale	37%	70%	40%	0%
64 x 32 mix and varying scale	-	144%	58%	15%
64 x 64 bit no scale	110-118	117%	36%	1%
64 x 64 bit varying scale	94%	110%	34%	0%
96 x 96 bit small scale	>=102%	154%	45%	11%
96 x 96 bit large scale	>=102%	149%	44%	10%
Mixed input (all 00...111 bitpatterns, with all scales and signs)	100%	132%	34%	7%

Both x86 implementations are significantly slower than the windows version in oleaut32 which should be used for rujit x86 (it seems to use coreclr version)

BenchmarkDotNet measurments for a single invokation

update -- removed benchmarks since updated values can be found in next post. "Method" explanation:

NetFramework: Manged method in the c++ assembly calls system.decimal methods Native: The new code Ole32: Calling oleauto implementation (the current implementation in windows) This should be the "same" code as the "NetFramework" implementation, but without error handling and with additional overhead for managed/native interop.

Summary for managed benchmarks

The timing difference for system.decimal and Ole32 seems to vary more than expected (1-11 ns)
The overhead of switching to/from managed seems measurable (see the add measurements) but even with the additional overhead it seems a final implementation should be measurable.
It should be possible to add the managed decimal implementation under a different name and compare it against the NetFramework implementation
For a real test of the how well a final implementation should implementation against the native implementations, a custom coreclr version (which calls the new methods) should be used It might be a good idea to do this before looking more into making the code x-platform
It might be interesting to extrapolate the results somewhat by subtracting the timing difference between "Native" and Ole32 from NetFramework in order to get a rough estimate of possible performance improvements

Daniel-Svensson commented 7 years ago

New Measurements for C# Benchmark

I've managed to add the decimal implementation from CoreRT as well as compiling a custom coreclr build (commit 6747fa19a30992f9efbf3eab5130e25b80e79686) both with and without the code changes to bclnative library.

Below follows measurements for

Custom CoreCLR build (x64) - with new code
Custom CoreCLR build (x64) - no changes
.Net 4.7 (x64)
.Net 4.7 (x86)

*"Method" explanation:

NetFramework: Manged method in the c++ assembly calls system.decimal methods
Native: new code called via PInvoke
Ole32: Calling oleauto implementation via PInvoke This should be the "same" code as the "NetFramework" implementation, but with error handling in C#.
CoreCRTManaged (yes it is misspelled) - decimal.cs from coreRT
CoreCRTManaged2 (yes it is misspelled) - decimal.cs from coreRT where arrays are replaced by stackallocations and pointers

The benchmarks beeing run are;

Mul 96by96 a = new decimal(1023, 1, 1, false, 10); b = new decimal(3, 1, 21, false, 8); timings of a * b
Div 96by96 a = new decimal(1023, 412, 213, false, 0); b = new decimal(32, 32, 1, false, 0); timings of a / b
Add 96by96 a = new decimal(1023, 345, 321, false, 3); b = new decimal(32, 23, 2, false, 0); timings of a + b

The C# benchmarks are currently nowhere near as exhaustive as the C++ variants since they only measure a single invocation with constant values. The C# benchmarks can be made much better (feel free to add more tests :) ) but I think they at least highlights that the new native code would result in measurable performance.

Custom CoreCLR build (x64)

Both benchmark dll and CoreRT dll with corecrt decimal type have been crossgen'ed.


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=2.0.0-preview2-006497
  [Host] : .NET Core ? (Framework 4.6.00001.0), 64bit RyuJIT

Toolchain=InProcessToolchain

Mul 96by96

Method	Mean	Error	StdDev	Scaled	Build
NetFramework	71.22 ns	0.2254 ns	0.2109 ns	0.39	modified code
Native	82.22 ns	0.2050 ns	0.1918 ns	0.45	modified code
Ole32	181.63 ns	0.9097 ns	0.8509 ns	1.00	modified code
CoreCRTManaged	174.09 ns	0.4658 ns	0.3890 ns	0.96	modified code
NetFramework	176.53 ns	0.1816 ns	0.1699 ns	0.97	original code
Native	83.01 ns	0.0913 ns	0.0854 ns	0.46	original code
Ole32	181.45 ns	0.0797 ns	0.0622 ns	1.00	original code
CoreCRTManaged	175.01 ns	0.2057 ns	0.1823 ns	0.96	original code

Div 96by96

Method	Mean	Error	StdDev	Scaled	Build
NetFramework	101.21 ns	0.2884 ns	0.2698 ns	0.47	modified code
Native	95.39 ns	0.0210 ns	0.0164 ns	0.44	modified code
Ole32	215.59 ns	0.8496 ns	0.7947 ns	1.00	modified code
CoreCRTManaged	281.16 ns	1.4220 ns	1.2605 ns	1.30	modified code
NetFramework	207.28 ns	0.1473 ns	0.1377 ns	0.95	original code
Native	98.60 ns	0.0551 ns	0.0515 ns	0.45	original code
Ole32	217.55 ns	0.4650 ns	0.4349 ns	1.00	original code
CoreCRTManaged	281.43 ns	2.7054 ns	2.5306 ns	1.29	original code

notes it seems strange that PInvoke method seems faster than netframework implementation where both performs the same amount of work, it might be that the file is compiled with other optimization settings in coreclr.

Add 96by96

Method	Mean	Error	StdDev	Scaled	Build
NetFramework	23.592 ns	0.1448 ns	0.1283 ns	0.65	modified code
Native	36.217 ns	0.0114 ns	0.0095 ns	1.00	modified code
Ole32	36.098 ns	0.3180 ns	0.2975 ns	1.00	modified code
CoreCRTManaged	65.480 ns	0.4731 ns	0.4425 ns	1.81	modified code
NetFramework	20.111 ns	0.0618 ns	0.0578 ns	0.56	original code
Native	35.983 ns	0.0361 ns	0.0338 ns	0.99	original code
Ole32	36.180 ns	0.0254 ns	0.0238 ns	1.00	original code
CoreCRTManaged	65.792 ns	0.1370 ns	0.1214 ns	1.82	original code

notes it seems like it could be a perf regression for this specific pair a larger set of numbers such as those in the c++ project could give a more reliable measurments.

64bit results .NET 4.7


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
  [Host]       : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2101.1
  Platform=X64  Runtime=Clr

Mul 96by96

Method	Mean	Error	StdDev	Scaled	Gen 0	Allocated
NetFramework	174.16 ns	0.0349 ns	0.0326 ns	0.97	-	0 B
Native	74.06 ns	0.0134 ns	0.0125 ns	0.41	-	0 B
Ole32	179.76 ns	0.0462 ns	0.0386 ns	1.00	-	0 B
CoreCRTManaged	165.21 ns	2.3580 ns	2.2057 ns	0.92	0.0150	48 B
CoreCRTManaged2	163.85 ns	0.0568 ns	0.0503 ns	0.91	-	0 B

Div 96by96

Method	Mean	Error	StdDev	Scaled	Gen 0	Allocated
NetFramework	204.50 ns	0.0171 ns	0.0143 ns	0.95	-	0 B
Native	96.87 ns	0.0309 ns	0.0258 ns	0.45	-	0 B
Ole32	215.84 ns	0.0189 ns	0.0147 ns	1.00	-	0 B
CoreCRTManaged	237.05 ns	0.2486 ns	0.2203 ns	1.10	0.0505	160 B
CoreCRTManaged2	205.71 ns	0.1383 ns	0.1294 ns	0.95	-	0 B

Add 96by96

Method	Mean	Error	StdDev	Scaled	Gen 0	Allocated
NetFramework	15.789 ns	0.0040 ns	0.0036 ns	0.53	-	0 B
Native	26.448 ns	0.0079 ns	0.0066 ns	0.90	-	0 B
Ole32	29.515 ns	0.0017 ns	0.0014 ns	1.00	-	0 B
CoreCRTManaged	65.021 ns	0.1426 ns	0.1264 ns	2.20	0.0279	88 B
CoreCRTManaged2	63.226 ns	0.1408 ns	0.1248 ns	2.14	0.0126	40 B

32bit results


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2101.1
  DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2101.1

Mul 96by96

Method	Mean	Error	StdDev	Scaled	ScaledSD
NetFramework	62.82 ns	0.5111 ns	0.4781 ns	0.86	0.01
Native	77.69 ns	0.2698 ns	0.2392 ns	1.06	0.01
Ole32	73.06 ns	0.3531 ns	0.3131 ns	1.00	0.00
CoreCRTManaged	242.85 ns	1.8053 ns	1.6004 ns	3.32	0.03

Div 96by96

Method	Mean	Error	StdDev	Scaled	ScaledSD
NetFramework	182.3 ns	0.4134 ns	0.3665 ns	1.82	0.01
Native	139.4 ns	1.4004 ns	1.3099 ns	1.39	0.02
Ole32	100.1 ns	0.7030 ns	0.6576 ns	1.00	0.00
CoreCRTManaged	363.6 ns	2.5959 ns	2.4282 ns	3.63	0.03

Method	Mean	Error	StdDev	Scaled
NetFramework	207.28 ns	0.1473 ns	0.1377 ns	0.95
Native	98.60 ns	0.0551 ns	0.0515 ns	0.45
Ole32	217.55 ns	0.4650 ns	0.4349 ns	1.00
CoreCRTManaged	281.43 ns	2.7054 ns	2.5306 ns	1.29

Add 96by96

Method	Mean	Error	StdDev	Scaled
NetFramework	20.111 ns	0.0618 ns	0.0578 ns	0.56
Native	35.983 ns	0.0361 ns	0.0338 ns	0.99
Ole32	36.180 ns	0.0254 ns	0.0238 ns	1.00
CoreCRTManaged	65.792 ns	0.1370 ns	0.1214 ns	1.82

Summary/Conclusions

CoreRT managed implementation

the corert C# implementation is significantly slower than the current x86 approach of using oleaut32
the corert C# implementation when allocations are removed seems like it can be on par or slightly faster than the current implementation
For x86
- Add, Mul: the "current" implementation where oleaut32 is called seems like the best approach for windows
- Div: It seems like div does not use oleaut but there seems to be much to gain from using it
For x64
- The new implementation seems faster or equal fast for the different scenarios (with add beeing an odd scenario)
  - Mul: scale factor compared to oleaut32 pinvoke goes from 0,96 to 0,39 (146% relative speedup)
  - Div: scale factor compared to oleaut32 pinvoke goes from 0,94 to 0.47 (100% relative speedup) but the pinvoke version calling native dll is faster so it should be able to improve this down to around 90ns (which would correspond to around 127% speedup)
  - Add a small perf regression for, for this specific input on the coreclr build
- The div measurements is much better but it seems like the native version is around 6% slower than pinvoke, when it should be faster, it might be worth looking into if this is because of different compilation options.
- The add measurments seems slower than the current version, could it be at least partly due to pgo missing for new code.
- General it seems like the perfomance win of the native VarDec*** methods can translate to actual runtime gainst
While more comprehensive tests can be added (for example similar to the measurements in the c++ project) it seems like the measured improvements from the c++ benchmarks should be able to propagate to running C# applications

pentp commented 7 years ago

Using only two constants for performance testing isn't sufficient - the CPU branch predictor will skip large parts of the code.

The current CoreRT implementation is a straightforward port from oleaut32 optimised for correctness and can definitely be improved. It allocates memory on every add/mul/div/round operation, uses 32bit math, etc. Like the proposed native variant, it could also benefit from runtime/JIT support for 64/128bit mul/div operations (maybe as private intrinsics).

Daniel-Svensson commented 7 years ago

@pentp thanks for the tip about optimizing managed implementation, ive updated must of the implementation to remove allocations (but I did not find all) and updated results for .net 47

As for it being only constants, that is because the intention was to see if performance from the more rigorous c++ becnhmarks can translate all the way to C# code. It should be fairly straight forward to add new tests which, just at the c++ benchmarks, process many more numbers.

janvorli commented 7 years ago

@Daniel-Svensson by "decimal.cpp from coreRT" you meant "decimal.cs from coreRT", right?

Daniel-Svensson commented 7 years ago

Thanks for spotting that @janvorli, i've edited the comment

pentp commented 7 years ago

Because I like (fast) managed code I made some basic optimisations to the CoreRT decimal code (no significant changes in algorithm structure) and measured using a weighted random distribution of decimals that might better represent real world use (gist is sorted for better overview). A total of 440^2=~190K combinations were tested (excluding overflows/div-by-0).


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), ProcessorCount=8
.NET Core SDK=2.0.0-preview2-006127
  [Host] : .NET Core ? (Framework 4.6.00001.0), 64bit RyuJIT

Toolchain=InProcessToolchain

ADD	Mean	Error	StdDev	Scaled	Gen 0	Allocated
Native	28.41 ns	0.0029 ns	0.0027 ns	1.00	-	0 B
CoreRT	49.82 ns	0.0632 ns	0.0591 ns	1.75	0.0114	48 B
CoreRT2	31.33 ns	0.0121 ns	0.0107 ns	1.10	-	0 B

MUL	Mean	Error	StdDev	Scaled	Gen 0	Allocated
Native	29.97 ns	0.0053 ns	0.0050 ns	1.00	-	0 B
CoreRT	31.71 ns	0.0456 ns	0.0427 ns	1.06	0.0114	48 B
CoreRT2	23.19 ns	0.0157 ns	0.0139 ns	0.77	-	0 B

DIV	Mean	Error	StdDev	Scaled	Gen 0	Allocated
Native	116.17 ns	0.0433 ns	0.0405 ns	1.00	-	0 B
CoreRT	182.84 ns	0.2681 ns	0.2508 ns	1.57	0.0397	167 B
CoreRT2	84.65 ns	0.0400 ns	0.0355 ns	0.73	-	0 B

Obstacles to further optimisation are dotnet/runtime#4155 / dotnet/runtime#5213 and lack of 128bit DIV/MUL intrinsics. I suspect both of these issues depend on JIT supporting multiple output registers in IR, although 128bit DIV could partially work with only a single output.

@jkotas Should I start integrating these optimisations into the CoreRT repo? When it's faster than the current native implementation in CoreCLR then Decimal can be moved to the shared partition and the native part removed from CoreCLR like you suggested.

And should I open a new issue for 128bit DIV/MUL support in Math class (together with unsigned Math.DivRem and Math.BigMul)? It looks like a better fit in Math than the new platform specific intrinsics. Could be useful for BigInteger also.

Also found a probable bug in VarDecDiv - it looks like shift with carry is done here using > not >=, but it's very unlikely to be hit (rgulRem[0] must be exactly 0x80000000). Same line in original C++ code.

jkotas commented 7 years ago

@jkotas Should I start integrating these optimisations into the CoreRT repo?

Yes, we would love to have these optimizations in CoreRT repo. Thank you for doing the work!

jkotas commented 7 years ago

And should I open a new issue for 128bit DIV/MUL support

https://github.com/dotnet/corefx/issues/17147 is about 128bit integer support in general. Not sure whether we need a separate issue for just DIV/MUL support at this point.

pentp commented 7 years ago

The Int128 proposal is interesting. However I feel like it should be built on top of some new intrinsics in Math (somewhat similar to the proposal of moving Vector<T> implementation to use new platform specific intrinsics underneath). Using Int128 directly for Decimal (or BigInteger) math requires working against all the extra abstractions to get to the basic operations (similar problems can be observed in the current decimal code, e.g. functions like DivMod64by32 that return quotient in Lo32, remainder in Hi32, while a caller wants to use them separately).

Daniel-Svensson commented 4 years ago

closing singe x64 perf is at least a little better now and most of the relevant performance improvements cannot be ported to C# without the help of instrincts for basic x64 instructions which seems to not come in a reasonable time

dotnet / runtime

Improve System.Decimal performance for x64 platform #7778

Proof of Concept

Measurment

Results:

Multiply

Add / Sub

Div

Coreclr and managed implementations

ARM64

Unifying 32 and 64bit

Mixed comments

Where to insert the code (if anywhere) and additional question?

Updated measurements

Where to go from here?

new divide measurments

BenchmarkDotNet measurments for a single invokation

Summary for managed benchmarks

New Measurements for C# Benchmark

Custom CoreCLR build (x64)

Mul 96by96

Div 96by96

Add 96by96

64bit results .NET 4.7

Mul 96by96

Div 96by96

Add 96by96

32bit results

Mul 96by96

Div 96by96

Add 96by96

Summary/Conclusions