dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.72k forks source link

Can "Math.DivRem" do division operation only once? #4155

Closed ygc369 closed 4 months ago

ygc369 commented 9 years ago

Currently it does division twice, but most CPUs support to get the two values by doing division only once. Besides,this method should support uint and ulong too.

category:cq theme:div-mod-rem skill-level:expert cost:large impact:medium

jkotas commented 9 years ago

Math.DivRem is not in the .NET Core public surface, but I agree that it would be nice for the code generators to recognize the pattern and perform the division just one.

cc @CarolEidt @cmckinsey

stephentoub commented 9 years ago

cc: @KrzysztofCwalina, as IIRC, Krzysztof bumped up against this as a measurable cost when doing some recent fomatting/parsing experimentation.

CarolEidt commented 9 years ago

This would not be a very difficult pattern to recognize: a div and a rem with incoming value numbers that match, though it doesn't "fit" with any existing (or other common) optimization patterns. The thing that would be challenging would be to deal with the resulting IR node that would produce two values in registers. Although that would not be all that difficult, it would require more pervasive changes.

tannergooding commented 9 years ago

:+1:

In short term, couldn't Math.DivRem be treated as a special-case (intrinsic)? I wouldn't think that this optimization wouldn't really require many changes.

Long term though, adding an optimization pattern to recognize / and % with matching value members would be great.

CarolEidt commented 9 years ago

@tannergooding - you would think that it would be simpler, but actually it requires the messiest part of the work (handling the node in the IR that produces two values), while only relieving the JIT of what is probably the more straightforward part of the implementation. It might actually be easier on x86 (32 bit) where we already have to deal with longs that occupy multiple registers.

tannergooding commented 7 years ago

@CarolEidt, I'm actually wondering why this would require more pervasive changes?

I would think that instructions like idiv should already document (via some metadata) that they end up writing two registers, correct? As without this being carried around by the code generator, it would potentially end up overwriting the second argument of the parameter list whenever it generates the idiv instruction.

I've not looked at the code, but I would guess that various instructions do indicate what registers end up being input/output by the instruction and that the changes here would be to allow the node to actually carry the second value (which could probably be done via inheritance and type checking). I actually think this is in a similar vein to having an intrinsic that takes 3 arguments where the GenTreeOp currently only allows 2 arguments.

ygc369 commented 4 years ago

@CarolEidt Is there any progress about this?

ygc369 commented 4 years ago

@tannergooding @CarolEidt After more than five years, has this issue been solved?

tannergooding commented 4 years ago

No, but .NET 5 brought about some register allocator changes that should (hopefully) make it simpler to support now.

caihongxu commented 4 years ago

Before this is addressed in .NET 5, any remedies we could do ourselves. e.g. is it possible to roll our own DivRem method to achieve doing division only once?

ygc369 commented 3 years ago

@tannergooding From this blog Announcing .NET 6 Preview 1, I see that there will be a new overload of DivRem for .Net 6, which returns a tuple. Will this new overload be able to do division only once?

tannergooding commented 3 years ago

The new API surface is setup in a way that would enable that optimization on x86/x64. I don't believe there is any instruction to compute quotient and remainder simultaneously on ARM32/ARM64.

Whether or not it gets enabled likely depends on prioritization and availability from the dev's familiar enough with that area of the codebase.

Bip901 commented 2 years ago

@tannergooding @CarolEidt After more than five years, has this issue been solved?

Seems like it's doing a single division now. Still not using the x86 optimization though.

ygc369 commented 2 years ago

I wonder why this issue so difficult to fix? .Net 7 is coming soon, when will this issue be fixed? @tannergooding

tannergooding commented 2 years ago

The issue is complex because you have to take an instruction with very explicit register usage and then map that back to a higher level concept.

In particular, div and idiv are really two instructions in one. They compute the quotient and the remainder "simultaneously". They also don't do simple 32 / 32 = 32 or 64 / 64 = 64, they do 128 / 64 = 64 and 64 / 32 = 32. They have a requirement that the input be in RDX:RAX (or EDX:EAX, or DX:AX, or AH:AL) and then return the result in left (Quotient, so RAX, EAX, AX, or AH) and right (Remainder, so RDX, EDX, DX, or AL).

This then needs to be mapped, at a higher level, back to T DivRem(T left, T right, out T remainder) or (T Quotient, T Remainder) DivRem(T left, T right).

In the case of the first, the Quotient is simple because the "natural" return register is RAX/EAX. The output is slightly problematic because what you have is really an address (so a GC tracked T*).

In the case of the latter, what you have is a struct containing two fields. On some platforms this will be "one register" (often RAX/EAX), on other platforms it will be "two registers" (often RAX:RDX/EAX:EDX), and on others still it will actually be an "out buffer" (so the signature is really ref ValueTuple<T, T> DivRem(out ValueTuple<T, T> result, T left, T right)).

In each of these cases, the JIT has some complexities it has to work around and deal with such as eliding the pointers/indirections and ensuring the data can be consumed directly.

There is also the note that this is an x86/x64 specific problem. Other platforms just have "div" and if you also need the remainder, you compute it exactly as x86/x64 is already in DivRem (remainder = left - (quotient * right)). On modern CPUs this isn't going to be terrible either because modern x86/x64 can optimize the case where you don't consume the remainder. Likewise, if right (the divisor) is a constant, you can often skip using div/idiv entirely and use an alternative sequence that is faster instead (at which point computing the remainder manually is often strictly better). -- One of the main optimizations an x86/x64 compiler can do is avoiding actually emitting div/idiv however it can because it is that expensive (which is something we do already do, so the light up opportunities for actually using a single div/idiv are rarer in the first place).


Now, the support was added to the JIT to better support instructions that "return" multiple registers. However, it is still quite a complex area due to all the scenarios, edge cases, and considerations listed above. We'll get it eventually, but its not the highest priority consideration since its likely not the bottleneck in most applications and when we do get to it, we have to ensure that codegen is good and not worse than the "fallback" we're generating today.

colejohnson66 commented 5 months ago

For an update, in 8.0, DivRem only does one division, then computes the remainder from that.

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunGAKAK4BLAHYYGARQEQMQ7GLQNhYhgCUY+bKIAmMKAEoGAESEA3dfkGjx2RcvHBDnaj1dMA7AwCy2DAAsAOhNzDT5bBkcAbi4eAF8aWKA=

tannergooding commented 5 months ago

DivRem has been implemented to do a division and then compute the remainder using multiplication for many releases now.

On non-xarch platforms, this is the only way to compute the remainder as well, notably. There is no single instruction that computes div+rem simultaneously.

colejohnson66 commented 5 months ago

Yes. So the main complaint that it performs two divisions is no longer valid.

tannergooding commented 5 months ago

And the overall history of the issue, which needs to be taken into account, covers that such a change went in. This issue then still exists to track the underlying spirit of the bug which is that its doing more work than necessary on some platforms.

ygc369 commented 4 months ago

@tannergooding Using multiplication is also extra work, although maybe faster than an extra division. Isn't it possible to get div+rem at the same time just by doing division once and no other operation? Maybe this cannot be done for all platforms as some platforms do not support it by hardware. But there must be platforms that support it, so, can it be done just on supported platforms?

bazzilic commented 4 months ago

multiplication is also extra work, although maybe faster than an extra division

it is indeed significantly faster

can it be done just on supported platforms

some comments above answer this question in great detail, e.g.: https://github.com/dotnet/runtime/issues/4155#issuecomment-1194360439

colejohnson66 commented 4 months ago

On x86, a division can take 15-30 clock cycles (sometimes over 40) and can only start one ever three or four. Multiplication, on the other hand, takes about eight and can start a new one every cycle.

tannergooding commented 4 months ago

Isn't it possible to get div+rem at the same time just by doing division once and no other operation

For x86/x64, yes. It isn't possible on other platforms as they don't provide such support. However, even for x86/x64 the benefit there is really using 1 instruction, it isn't actually much faster (than doing div, mul, sub explicitly). In many cases it is actually more desirable to break it apart to avoid a dependency chain or use an alternative sequence (especially if one of the inputs is known constant).

Modern CPUs (almost any in the past 12-14 years) account for these things and often have special div/idiv support that breaks it apart internally to what is functionally div, mul, sub already so that the quotient can be available prior to the remainder being available. They may even skip the internal microcode for computing the remainder in some cases if they can see its unused. The time is also often dependent on the number of significant bits in the dividend, although even newer processors have changed implementations and now it is often based on the number of significant bits in the quotient instead.

On x86, a division can take 15-30 clock cycles (sometimes over 40) and can only start one ever three or four. Multiplication, on the other hand, takes about eight and can start a new one every cycle.

Notably this very much depends on microarchitecture, whether you're doing signed or unsigned, and it has massively improved in even more recent years

Your multiplication info is also quite a bit off, both 32-bit and 64-bit have been 3 cycles with 1 reciprocal throughput since the early 2000s. If you're doing a big multiplication (64x64=128) then it used to take 7-9 or 3-10 cycles for pre 2011 chips, but that's not typical.