dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.65k forks source link

Improve System.Decimal performance for x64 platform #7778

Closed Daniel-Svensson closed 4 years ago

Daniel-Svensson commented 7 years ago

The current implementation on forwards call to windows implementation of "decimal" (VarDecAdd, VarDecSub, VarDecMul) or to the port of these functions found under palrt (as far as I understand the code).

These methods are written optimised for 32bit platforms and by using 64bit instructions it is possible to significantly improve performance.

I have written a small proof of concept to illustrate the gains I is looking forward to try to integrate the code with coreclr, but have some questions and want feedback on how to best integrate it before submitting a PR with those methods.

Questions:

Proof of Concept

I have created x64 aware methods for the aritmetic instructions Add, Sub, Mul and Div based on the current code in coreclr, there are not real changes to algoritms or other logic apart from changing 32bit aritmetic to 64bit and some of the results of that. And using some instrincts for bitsearch and carry propagation.

This is a summary of the measurements from example projekt which can be found at: https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting

Measurment

https://github.com/Daniel-Svensson/ClrExperiments/blob/master/ClrDecimal/coreclrtesting/main.cpp

In short program generates a number of "semi random" input where different number of bits are set. Then it calls the method under test for all combinations of the input.

Results:

See the results folder (https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting/results) for complete output results. I have tried to summarize results for both core i5 2500K and i7 6700K below. I5 result are with a few minor changes but otherwise same code , the biggest performance spread is in some of the division tests.

The results below are against the oleauto32 implementation, when compared against implementations in palrt the results are very similar since those results are within a few % of the timings for oleauto32.

Multiply

Add / Sub

Div

Speedup range: 10-270% For mixed input (all 00...111 bitpatterns, with all scales and signs): ~100%

Measurment Speedup
32 x 32 bit >50%
32 x 32 bit with scale ~37%
64 x 64 bit no scale 109-118%
64 x 64 bit varying scale 94-
96 x 96 bit >102%
danmoseley commented 7 years ago

@janvorli @mellinoe

janvorli commented 7 years ago

@Daniel-Svensson thank you for looking into this! The numbers from your experiments look really great.

As for your questions:

Can these improvements make its way to desktop clr?

It would first have to be tested if these changes would improve desktop. Since on desktop, we are calling into OLEAUT32, it is possible that the code in there was already optimized for 64 bits. @jkotas, I guess that if the perf gains are similar on Windows, we would want to port the change to desktop, right?

Which methods are most common? I would assume that apart from the basic aritmethic operations that conversions to/from text as well as double/int can be quite common.

It really depends on the actual applications. Decimals are quite special.

What methods have the most to gain for this ? I have on +,-,/ and * for now but there might be some other low hanging fruit.

I would stay with these four operations. If you look at the list of methods on decimal implemented by the native runtime (https://github.com/dotnet/coreclr/blob/master/src/vm/ecalllist.h#L834-L849), you'll see that they all use the VarDecXXX functions that you've already optimized.

How to best integrate it in coreclr code? I am thinking along the way of keeping _x64 suffix on these methods (VarDecAdd_x64) and only include the code for x64 platforms. This would be be coupled with a macro to redifine VarDecAdd as VarDecAdd_x64 to route all calls to the x64 implementation.

I wonder if we could somehow unify the code for 32 and 64 bit. What if we implemented the 64 bit intrinsics for 32 bit just in C and see if the perf stays the same for 32 bits. Just from looking at a sample of your functions, it looks like it might be possible. If we cannot do that, I would recommend simply creating a new file called e.g. decarith64.cpp and put all the implementation in there. Then the build would use decarith.cpp for 32 bit builds and decarith64.cpp for 64 bit builds. If there are common functions that share implementation, then we could have three of these (e.g. decarith.cpp, decarith32.cpp and decarith64.cpp).

I am not really proficient with cmake so if I run into problems with the integration it would be greate if someone was willing to help.

I'd be happy to give you a guidance in cmake stuff.

janvorli commented 7 years ago

I also have a question - will the intrinsics used in your changes work for Unix ARM64?

jkotas commented 7 years ago

@jkotas, I guess that if the perf gains are similar on Windows, we would want to port the change to desktop, right?

Desktop is calling the implementation of these methods in oleaut32. I doubt that fixing a oleaut32 perf problem by duplicating its implementation would fly for desktop. It would most likely turn into suggestion to Windows to fix the perf problem in oleaut32.

Ideally, we would have fully managed implementations on these methods in C# in .NET Core. There is C# implementation of these methods in CoreRT: https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/src/System/Decimal.DecCalc.cs . I would recommend focusing the optimization effort on this C# implementation, and once it is clearly better than what we have in CoreCLR today, switch CoreCLR to use it as well.

janvorli commented 7 years ago

Ah ok, I've not realized we already have a managed implementation in CoreRT. So I agree with @jkotas that we should focus on improving the managed implementation. @Daniel-Svensson using your measurements would be a good estimate on how far we could possibly get.

Daniel-Svensson commented 7 years ago

Just some quick replies, I will try to get some time this weekend to address most of the other comments

Coreclr and managed implementations

@janvorli @jkotas

Regarding focusing on the managed version: Being able to use the full with of 64bit instructions, that is 64bit multiply with 128bit results and division of 128bit by 64 bit was key to the performance improvements. That allowed most of the reduction in logic and cut the expected number of iterations in the loop in 2. Without that I get it is possible to get maybe at most double-digit improvements.

It should be possible to get a few percentage improvements if we somehow get access to bit scanning operations, but without 128bit results I see no way to come anywhere close most of these numbers.

ARM64

I don't think the instrincts works as they are right now for ARM64, but most of them should hopefully be able to fix easily. There are currently a few other assumptions embedded in the code such as being able to cast DWORD64 to DWORD32 back and forth having index 2 (and 0 etc.) in 32bit array to refer to lower 32bit of the corresponding 64bit word.

That said the code can probably be used as a basis for arm64. The corresponding instructions (add, add with carry etc.) as well as multiplication of two unsigned 64bit words producing 128-bit result (2 instructions UMULL, UMULH) seems to be availible. I am however unsure if there is an efficient way to do for "128" by 64 divide in arm64 (64-bit divide with "carry")

source: https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

Unifying 32 and 64bit

A reasonable amount of the code can probably be shared such some of the helper methods, but the impact in 32bit mode must be measured. While it would be possible to just implement some of the instructions as functions on 32bit/(non x86) it is quite hard to tell beforehand what the performance impact would be.

The original implementation seems quite optimized for 32bit with more short-circuits and special cases than required in the prototype. I do hover think it might be worth it to measure the impact of unifying some of the methods, they will not be optimal for x86 but it is possible that some may improve or have only a very small impact.

Mixed comments

Since on desktop, we are calling into OLEAUT32, it is possible that the code in there was already optimized for 64 bits.

My numbers are primary measurements against the oleaut32 implementations om 64-bit Windows (with coreclr/palrt as separate measurements against the code in decimal/decarith.cpp). I was assuming that https://github.com/dotnet/coreclr/blob/master/src/classlibnative/bcltype/decimal.cpp are the code which https://github.com/dotnet/coreclr/blob/master/src/vm/ecalllist.h#L834-L849 refers to. So it does not look like they are optimized in Windows 10 Anniversary and I doubt that changes soon.

Desktop is calling the implementation of these methods in oleaut32. I doubt that fixing a oleaut32 perf problem by duplicating its implementation would fly for desktop. It would most likely turn into suggestion to Windows to fix the perf problem in oleaut32.

I would love to have the code integrated to get the benefits, but Windows is not exactly OSS so that seems like a dead end to me. I have seen a few requests to have the performance tuned for x64 with maybe the first dating back to the days of Windows Vista. However, having the code in windows would only improve windows desktop while leaving the other platforms left behind.

Where to insert the code (if anywhere) and additional question?

  1. Where to update I guessed that the correct place would be to update classlibnative/bcltype, I see no other way for the changes to take effect on windows otherwise. palrt and decarith*.cpp sounds like it is only used for some platforms and I did not see any project reference the files when compiling using VS 2017 on windows.
  2. Decimal divide Currently there are 2 variants of division in decimal.cpp (both DoDivideThrow and DoDivide) as well as VarDecDiv in decarith.cpp (palrt) a; Do you know why there are so many different implementations (the first 2 does not seem to have better performance than VarDecDiv)? b; What about rewriting DoDivideThrow and DoDivide so that they both utilize VarDecDiv instead?
Daniel-Svensson commented 7 years ago

I wonder if we could somehow unify the code for 32 and 64 bit. What if we implemented the 64 bit intrinsics for 32 bit just in C and see if the perf stays the same for 32 bits.

@janvorli I tried the approach and it seems to work fine after some minor tweaking. I now have mostly unified code for x64 and 32bit (with and without x86 asm).

I also have a question - will the intrinsics used in your changes work for Unix ARM64?

Not directly. It should be possible to implement at least most of them them quite efficently for bot ARM64 and AMR32.

I would love to have the code integrated to get the benefits, but Windows is not exactly OSS so that seems like a dead end to me.

Feel free to contact me if anybody is intereseted in using my work to improve the windows/"oleaut" implementation.

Updated measurements

I have made very little tweaking to the 32bit implementation, there might be some additional easy things to do in order to increase performance. These measurements were made running a intel core i5 2500k, other architectures will probably get slightly different numbers.

Mul x64 - previous version x64 x86 w32 - no asm
compared against oleaut oleaut coreclr coreclr
32 x 32 bit >17% 14% -1% -2%
32 x 32 bit with scale >127% 126% 4% -5%
64 x 64 -> 64 bit result >195% >195% 55% 43%
64 x 64 -> 128 bit result 144 137% 26% 25%
64 x 64 -> 128 bit result with rescale 124 158% 67% 11%
96 x 96 bit with overflow >270 219% 5% 6%
96 x 96 bit no overflow >210% 279% 117% 10%
Mixed input (all 00...111 bitpatterns, with all scales and signs) ~160% 226% 99% 8%
Add
32 x 32 bit 11% ~9% 20% 17%
32 x 32 bit with scale 21% ~30% 6% 7%
64 + 32 bit varying scale 60 82% 5% -3%
64 x 64 bit no scale 31 31% 20% 17%
64 x 64 bit varying scale 42 36% 8% 9%
96 x 96 bit 67-90% 105-128% 70-78% 17%
Mixed input (all 00...111 bitpatterns, with all scales and signs) 110% 153% 70% 14%
Sub
32 x 32 bit 1% 4% 3% 0
32 x 32 bit with scale 20% 30% 5% 6%
64 + 32 bit varying scale 64% 85% 4% -2%
64 x 64 bit no scale 1% 4% 2% 0
64 x 64 bit varying scale 36% 37% 6 8%
96 x 96 bit 70-97% 110-132% 65-75% 18%
Mixed input (all 00...111 bitpatterns, with all scales and signs) 110% 153% 69% 14%

Where to go from here?

I would prefer to update try and update the current implementation and see what the final perf impact would be.

janvorli commented 7 years ago

@Daniel-Svensson I am really sorry for not responding for such a long time. I have seen your update and wanted to discuss it further with @jkotas, but I have completely forgotten about it. These improvements are huge. @jkotas can you please take a look at the updated results above? My thinking here is that as a next step it would be great to measure the performance of the managed implementation @jkotas has referred to and add it to the tables above to get an idea on how it stands and if it looks like there is a chance to get close to these numbers with managed implementation.

Daniel-Svensson commented 7 years ago

@janvorli @jkotas I have no hurry with this issue, but I am hoping for this year. there are some minor cleanup left to do but I am not working much on this at the moment. I have however updated the divide implementation to also support 32bit code and might post updated performance numbers withing the next weeks given a rainy day.

It should be possible to add the managed implementation to the benchmark project in order to compare it.

I've added a mananged c++ wrapper for the new code add added a few simple benchmarks as a start.

new divide measurments

Div x64 -previous x64 x86 x86 - no asm
compared against oleaut oleaut coreclr coreclr
32 x 32 bit 50% 84% 49% 3%
32 x 32 bit with scale 37% 70% 40% 0%
64 x 32 mix and varying scale - 144% 58% 15%
64 x 64 bit no scale 110-118 117% 36% 1%
64 x 64 bit varying scale 94% 110% 34% 0%
96 x 96 bit small scale >=102% 154% 45% 11%
96 x 96 bit large scale >=102% 149% 44% 10%
Mixed input (all 00...111 bitpatterns, with all scales and signs) 100% 132% 34% 7%

Both x86 implementations are significantly slower than the windows version in oleaut32 which should be used for rujit x86 (it seems to use coreclr version)

BenchmarkDotNet measurments for a single invokation

update -- removed benchmarks since updated values can be found in next post. "Method" explanation:

NetFramework: Manged method in the c++ assembly calls system.decimal methods Native: The new code Ole32: Calling oleauto implementation (the current implementation in windows) This should be the "same" code as the "NetFramework" implementation, but without error handling and with additional overhead for managed/native interop.

Summary for managed benchmarks

Daniel-Svensson commented 7 years ago

New Measurements for C# Benchmark

I've managed to add the decimal implementation from CoreRT as well as compiling a custom coreclr build (commit 6747fa19a30992f9efbf3eab5130e25b80e79686) both with and without the code changes to bclnative library.

Below follows measurements for

*"Method" explanation:

The benchmarks beeing run are;

The C# benchmarks are currently nowhere near as exhaustive as the C++ variants since they only measure a single invocation with constant values. The C# benchmarks can be made much better (feel free to add more tests :) ) but I think they at least highlights that the new native code would result in measurable performance.

Custom CoreCLR build (x64)

Both benchmark dll and CoreRT dll with corecrt decimal type have been crossgen'ed.


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=2.0.0-preview2-006497
  [Host] : .NET Core ? (Framework 4.6.00001.0), 64bit RyuJIT

Toolchain=InProcessToolchain  

Mul 96by96

Method Mean Error StdDev Scaled Build
NetFramework 71.22 ns 0.2254 ns 0.2109 ns 0.39 modified code
Native 82.22 ns 0.2050 ns 0.1918 ns 0.45 modified code
Ole32 181.63 ns 0.9097 ns 0.8509 ns 1.00 modified code
CoreCRTManaged 174.09 ns 0.4658 ns 0.3890 ns 0.96 modified code
NetFramework 176.53 ns 0.1816 ns 0.1699 ns 0.97 original code
Native 83.01 ns 0.0913 ns 0.0854 ns 0.46 original code
Ole32 181.45 ns 0.0797 ns 0.0622 ns 1.00 original code
CoreCRTManaged 175.01 ns 0.2057 ns 0.1823 ns 0.96 original code

Div 96by96

Method Mean Error StdDev Scaled Build
NetFramework 101.21 ns 0.2884 ns 0.2698 ns 0.47 modified code
Native 95.39 ns 0.0210 ns 0.0164 ns 0.44 modified code
Ole32 215.59 ns 0.8496 ns 0.7947 ns 1.00 modified code
CoreCRTManaged 281.16 ns 1.4220 ns 1.2605 ns 1.30 modified code
NetFramework 207.28 ns 0.1473 ns 0.1377 ns 0.95 original code
Native 98.60 ns 0.0551 ns 0.0515 ns 0.45 original code
Ole32 217.55 ns 0.4650 ns 0.4349 ns 1.00 original code
CoreCRTManaged 281.43 ns 2.7054 ns 2.5306 ns 1.29 original code

notes it seems strange that PInvoke method seems faster than netframework implementation where both performs the same amount of work, it might be that the file is compiled with other optimization settings in coreclr.

Add 96by96

Method Mean Error StdDev Scaled Build
NetFramework 23.592 ns 0.1448 ns 0.1283 ns 0.65 modified code
Native 36.217 ns 0.0114 ns 0.0095 ns 1.00 modified code
Ole32 36.098 ns 0.3180 ns 0.2975 ns 1.00 modified code
CoreCRTManaged 65.480 ns 0.4731 ns 0.4425 ns 1.81 modified code
NetFramework 20.111 ns 0.0618 ns 0.0578 ns 0.56 original code
Native 35.983 ns 0.0361 ns 0.0338 ns 0.99 original code
Ole32 36.180 ns 0.0254 ns 0.0238 ns 1.00 original code
CoreCRTManaged 65.792 ns 0.1370 ns 0.1214 ns 1.82 original code

notes it seems like it could be a perf regression for this specific pair a larger set of numbers such as those in the c++ project could give a more reliable measurments.

64bit results .NET 4.7


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
  [Host]       : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2101.1
  Platform=X64  Runtime=Clr  

Mul 96by96

Method Mean Error StdDev Scaled Gen 0 Allocated
NetFramework 174.16 ns 0.0349 ns 0.0326 ns 0.97 - 0 B
Native 74.06 ns 0.0134 ns 0.0125 ns 0.41 - 0 B
Ole32 179.76 ns 0.0462 ns 0.0386 ns 1.00 - 0 B
CoreCRTManaged 165.21 ns 2.3580 ns 2.2057 ns 0.92 0.0150 48 B
CoreCRTManaged2 163.85 ns 0.0568 ns 0.0503 ns 0.91 - 0 B

Div 96by96

Method Mean Error StdDev Scaled Gen 0 Allocated
NetFramework 204.50 ns 0.0171 ns 0.0143 ns 0.95 - 0 B
Native 96.87 ns 0.0309 ns 0.0258 ns 0.45 - 0 B
Ole32 215.84 ns 0.0189 ns 0.0147 ns 1.00 - 0 B
CoreCRTManaged 237.05 ns 0.2486 ns 0.2203 ns 1.10 0.0505 160 B
CoreCRTManaged2 205.71 ns 0.1383 ns 0.1294 ns 0.95 - 0 B

Add 96by96

Method Mean Error StdDev Scaled Gen 0 Allocated
NetFramework 15.789 ns 0.0040 ns 0.0036 ns 0.53 - 0 B
Native 26.448 ns 0.0079 ns 0.0066 ns 0.90 - 0 B
Ole32 29.515 ns 0.0017 ns 0.0014 ns 1.00 - 0 B
CoreCRTManaged 65.021 ns 0.1426 ns 0.1264 ns 2.20 0.0279 88 B
CoreCRTManaged2 63.226 ns 0.1408 ns 0.1248 ns 2.14 0.0126 40 B

32bit results


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2101.1
  DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2101.1

Mul 96by96

Method Mean Error StdDev Scaled ScaledSD
NetFramework 62.82 ns 0.5111 ns 0.4781 ns 0.86 0.01
Native 77.69 ns 0.2698 ns 0.2392 ns 1.06 0.01
Ole32 73.06 ns 0.3531 ns 0.3131 ns 1.00 0.00
CoreCRTManaged 242.85 ns 1.8053 ns 1.6004 ns 3.32 0.03

Div 96by96

Method Mean Error StdDev Scaled ScaledSD
NetFramework 182.3 ns 0.4134 ns 0.3665 ns 1.82 0.01
Native 139.4 ns 1.4004 ns 1.3099 ns 1.39 0.02
Ole32 100.1 ns 0.7030 ns 0.6576 ns 1.00 0.00
CoreCRTManaged 363.6 ns 2.5959 ns 2.4282 ns 3.63 0.03
Method Mean Error StdDev Scaled
NetFramework 207.28 ns 0.1473 ns 0.1377 ns 0.95
Native 98.60 ns 0.0551 ns 0.0515 ns 0.45
Ole32 217.55 ns 0.4650 ns 0.4349 ns 1.00
CoreCRTManaged 281.43 ns 2.7054 ns 2.5306 ns 1.29

Add 96by96

Method Mean Error StdDev Scaled
NetFramework 20.111 ns 0.0618 ns 0.0578 ns 0.56
Native 35.983 ns 0.0361 ns 0.0338 ns 0.99
Ole32 36.180 ns 0.0254 ns 0.0238 ns 1.00
CoreCRTManaged 65.792 ns 0.1370 ns 0.1214 ns 1.82

Summary/Conclusions

CoreRT managed implementation

pentp commented 7 years ago

Using only two constants for performance testing isn't sufficient - the CPU branch predictor will skip large parts of the code.

The current CoreRT implementation is a straightforward port from oleaut32 optimised for correctness and can definitely be improved. It allocates memory on every add/mul/div/round operation, uses 32bit math, etc. Like the proposed native variant, it could also benefit from runtime/JIT support for 64/128bit mul/div operations (maybe as private intrinsics).

Daniel-Svensson commented 7 years ago

@pentp thanks for the tip about optimizing managed implementation, ive updated must of the implementation to remove allocations (but I did not find all) and updated results for .net 47

As for it being only constants, that is because the intention was to see if performance from the more rigorous c++ becnhmarks can translate all the way to C# code. It should be fairly straight forward to add new tests which, just at the c++ benchmarks, process many more numbers.

janvorli commented 7 years ago

@Daniel-Svensson by "decimal.cpp from coreRT" you meant "decimal.cs from coreRT", right?

Daniel-Svensson commented 7 years ago

Thanks for spotting that @janvorli, i've edited the comment

pentp commented 7 years ago

Because I like (fast) managed code I made some basic optimisations to the CoreRT decimal code (no significant changes in algorithm structure) and measured using a weighted random distribution of decimals that might better represent real world use (gist is sorted for better overview). A total of 440^2=~190K combinations were tested (excluding overflows/div-by-0).


BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), ProcessorCount=8
.NET Core SDK=2.0.0-preview2-006127
  [Host] : .NET Core ? (Framework 4.6.00001.0), 64bit RyuJIT

Toolchain=InProcessToolchain  
ADD Mean Error StdDev Scaled Gen 0 Allocated
Native 28.41 ns 0.0029 ns 0.0027 ns 1.00 - 0 B
CoreRT 49.82 ns 0.0632 ns 0.0591 ns 1.75 0.0114 48 B
CoreRT2 31.33 ns 0.0121 ns 0.0107 ns 1.10 - 0 B
MUL Mean Error StdDev Scaled Gen 0 Allocated
Native 29.97 ns 0.0053 ns 0.0050 ns 1.00 - 0 B
CoreRT 31.71 ns 0.0456 ns 0.0427 ns 1.06 0.0114 48 B
CoreRT2 23.19 ns 0.0157 ns 0.0139 ns 0.77 - 0 B
DIV Mean Error StdDev Scaled Gen 0 Allocated
Native 116.17 ns 0.0433 ns 0.0405 ns 1.00 - 0 B
CoreRT 182.84 ns 0.2681 ns 0.2508 ns 1.57 0.0397 167 B
CoreRT2 84.65 ns 0.0400 ns 0.0355 ns 0.73 - 0 B

Obstacles to further optimisation are dotnet/runtime#4155 / dotnet/runtime#5213 and lack of 128bit DIV/MUL intrinsics. I suspect both of these issues depend on JIT supporting multiple output registers in IR, although 128bit DIV could partially work with only a single output.

@jkotas Should I start integrating these optimisations into the CoreRT repo? When it's faster than the current native implementation in CoreCLR then Decimal can be moved to the shared partition and the native part removed from CoreCLR like you suggested.

And should I open a new issue for 128bit DIV/MUL support in Math class (together with unsigned Math.DivRem and Math.BigMul)? It looks like a better fit in Math than the new platform specific intrinsics. Could be useful for BigInteger also.

Also found a probable bug in VarDecDiv - it looks like shift with carry is done here using > not >=, but it's very unlikely to be hit (rgulRem[0] must be exactly 0x80000000). Same line in original C++ code.

jkotas commented 7 years ago

@jkotas Should I start integrating these optimisations into the CoreRT repo?

Yes, we would love to have these optimizations in CoreRT repo. Thank you for doing the work!

jkotas commented 7 years ago

And should I open a new issue for 128bit DIV/MUL support

https://github.com/dotnet/corefx/issues/17147 is about 128bit integer support in general. Not sure whether we need a separate issue for just DIV/MUL support at this point.

pentp commented 7 years ago

The Int128 proposal is interesting. However I feel like it should be built on top of some new intrinsics in Math (somewhat similar to the proposal of moving Vector<T> implementation to use new platform specific intrinsics underneath). Using Int128 directly for Decimal (or BigInteger) math requires working against all the extra abstractions to get to the basic operations (similar problems can be observed in the current decimal code, e.g. functions like DivMod64by32 that return quotient in Lo32, remainder in Hi32, while a caller wants to use them separately).

Daniel-Svensson commented 4 years ago

closing singe x64 perf is at least a little better now and most of the relevant performance improvements cannot be ported to C# without the help of instrincts for basic x64 instructions which seems to not come in a reasonable time