We need a way to specify FP denormal behavior on a per-instruction basis

Quuxplusone commented 7 years ago


Bugzilla Link	PR31285
Status	NEW
Importance	P normal
Reported by	Andy Kaylor (andrew.kaylor@intel.com)
Reported on	2016-12-05 18:47:21 -0800
Last modified on	2020-05-17 22:45:07 -0700
Version	unspecified
Hardware	PC All
CC	andrew.kaylor@intel.com, david.l.kreitzer@intel.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, Matthew.Arsenault@amd.com, paul_robinson@playstation.sony.com, spatel+llvm@rotateright.com, yuanfang.chen@sony.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

As per the discussion in the comments for https://reviews.llvm.org/D27028, we need a way to control flush-to-zero behavior of floating point operations on a per instructions basis.

Currently there is a TargetMachine option and a set of function attributes for controlling denormal behavior. It isn't clear to me whether this approach is sufficient for the needs of all architectures.

Quuxplusone commented 7 years ago

I think we should have .ftz variants of the various rounding modes, and then a function attribute for the default mode the standard operations will use.

The ftz is needed to decide how to correctly lower some operations on AMDGPU. Disabling denormals in these cases is not an optimization, although some other optimizations are allowed when denormals are disabled. Implementing some of the standard operations requires emitting intermediate code which does modify the FP environment which for now we are using custom target nodes for the side effecting FP instructions.

Quuxplusone commented 7 years ago

Matt, it sounds like you are requesting a mechanism for enforcing flush-to-zero
behavior as opposed to merely permitting it. Is that accurate?

If we want to provide the same level of generality as

  namespace FPDenormal {
    enum DenormalMode {
      IEEE,           // IEEE 754 denormal numbers
      PreserveSign,   // the sign of a flushed-to-zero number is preserved in
                      // the sign of 0
      PositiveZero    // denormals are flushed to positive zero
    };
  }

then folding flush-to-zero behavior into the rounding mode argument of the
constrained intrinsics seems impractical. Maybe we want to add a third metadata
argument for it? Also, do we need to make a distinction between output
denormals and input denormals? For x86, the MXCSR has separate FTZ and DAZ
controls for output and input denormals, respectively. I don't know what other
architectures provide.

Quuxplusone commented 7 years ago

(In reply to comment #2)
> Matt, it sounds like you are requesting a mechanism for enforcing
> flush-to-zero behavior as opposed to merely permitting it. Is that accurate?
>
> If we want to provide the same level of generality as
>
>   namespace FPDenormal {
>     enum DenormalMode {
>       IEEE,           // IEEE 754 denormal numbers
>       PreserveSign,   // the sign of a flushed-to-zero number is preserved in
>                       // the sign of 0
>       PositiveZero    // denormals are flushed to positive zero
>     };
>   }
>
> then folding flush-to-zero behavior into the rounding mode argument of the
> constrained intrinsics seems impractical. Maybe we want to add a third
> metadata argument for it?

This was my recommendation; and I think we should mirror exactly the options we
have in the backend. If we need any extensions to the set, then we should add
them both to the set the backend understands and also the set the intrinsics
accept.

> Also, do we need to make a distinction between
> output denormals and input denormals? For x86, the MXCSR has separate FTZ
> and DAZ controls for output and input denormals, respectively. I don't know
> what other architectures provide.

Quuxplusone commented 7 years ago

(In reply to comment #2)
> Matt, it sounds like you are requesting a mechanism for enforcing
> flush-to-zero behavior as opposed to merely permitting it. Is that accurate?

Yes. The lowerings for some operations (fdiv in particular) will change
depending on whether denormals are enabled or not since they are expansions and
not a single instruction like for example on x86. Additionally some of the
OpenCL builtin library math function implementations also need to ensure that
denormals are enabled or disabled for some subsequence of the function.

>
> If we want to provide the same level of generality as
>
>   namespace FPDenormal {
>     enum DenormalMode {
>       IEEE,           // IEEE 754 denormal numbers
>       PreserveSign,   // the sign of a flushed-to-zero number is preserved in
>                       // the sign of 0
>       PositiveZero    // denormals are flushed to positive zero
>     };
>   }
>
> then folding flush-to-zero behavior into the rounding mode argument of the
> constrained intrinsics seems impractical. Maybe we want to add a third
> metadata argument for it? Also, do we need to make a distinction between
> output denormals and input denormals? For x86, the MXCSR has separate FTZ
> and DAZ controls for output and input denormals, respectively. I don't know
> what other architectures provide.

AMDGPU's FP mode can be set to flush input or output denormals (although I am
pretty sure we only ever set input/output to the same value). The subtarget
feature we've been using for the default mode doesn't distinguish these.

Quuxplusone commented 7 years ago

(In reply to comment #4)
> Yes. The lowerings for some operations (fdiv in particular) will change
> depending on whether denormals are enabled or not since they are expansions
> and not a single instruction like for example on x86. Additionally some of
> the OpenCL builtin library math function implementations also need to ensure
> that denormals are enabled or disabled for some subsequence of the function.

If I may drift toward the front end a bit here, how do you envision the
denormal argument being set?  Is there some language specific syntax that you
expect to trigger it?

In the case of the rounding mode argument as I've proposed it, I'm imaging the
front end will always set it to dynamic unless a pragma is used to declare the
rounding mode for a given scope and then hopefully I'll be able to write a pass
that looks for calls that specifically set the rounding mode and replaces the
argument for instructions that we can prove will have a specific value.

I'm just trying to understand how ftz would work as an argument to the
intrinsics.

Quuxplusone commented 7 years ago

(In reply to comment #5)
> (In reply to comment #4)
> > Yes. The lowerings for some operations (fdiv in particular) will change
> > depending on whether denormals are enabled or not since they are expansions
> > and not a single instruction like for example on x86. Additionally some of
> > the OpenCL builtin library math function implementations also need to ensure
> > that denormals are enabled or disabled for some subsequence of the function.
>
> If I may drift toward the front end a bit here, how do you envision the
> denormal argument being set?  Is there some language specific syntax that
> you expect to trigger it?
>
> In the case of the rounding mode argument as I've proposed it, I'm imaging
> the front end will always set it to dynamic unless a pragma is used to
> declare the rounding mode for a given scope and then hopefully I'll be able
> to write a pass that looks for calls that specifically set the rounding mode
> and replaces the argument for instructions that we can prove will have a
> specific value.
>
> I'm just trying to understand how ftz would work as an argument to the
> intrinsics.

A pragma like the rounding mode I suppose. For the main use cases we have
writing the handful of wrapper functions necessary in IR would probably be good
enough.

Quuxplusone commented 7 years ago

(In reply to comment #4)
> Yes. The lowerings for some operations (fdiv in particular) will change
> depending on whether denormals are enabled or not since they are expansions
> and not a single instruction like for example on x86. Additionally some of
> the OpenCL builtin library math function implementations also need to ensure
> that denormals are enabled or disabled for some subsequence of the function.

If I understand correctly you want the semantics of the ftz argument to be such
that the lowering somehow enforces or guarantees the ftz mode.  This is
different from what I have proposed for the rounding mode argument.

What I intended with my current proposal for constrained FP intrinsics is that
the rounding mode argument will simply tell the optimizer what assumptions it
can make about the rounding mode at any given time and if additional
instructions are needed to set the rounding mode those will have been inserted
separately (by the front end).

I'm not familiar with the AMD GPU architecture, but for X86 targets having the
intrinsics guarantee either ftz mode or rounding mode could require, in a worst
case scenario, that an extra LDMXCSR instruction (or the x87 equivalent) be
inserted before every floating point operation.  We could, theoretically,
eliminate this instruction in cases where we could prove that the mode was
already what we wanted it to be, but my preference would be to have the front
end insert calls to set the FP environment where needed and simply allow
CodeGen to rely on the fact that the mode was already set.

Does this present a conflict with the needs of the AMD GPU architecture?

Quuxplusone commented 7 years ago

(In reply to comment #7)
> (In reply to comment #4)
> > Yes. The lowerings for some operations (fdiv in particular) will change
> > depending on whether denormals are enabled or not since they are expansions
> > and not a single instruction like for example on x86. Additionally some of
> > the OpenCL builtin library math function implementations also need to ensure
> > that denormals are enabled or disabled for some subsequence of the function.
>
> If I understand correctly you want the semantics of the ftz argument to be
> such that the lowering somehow enforces or guarantees the ftz mode.  This is
> different from what I have proposed for the rounding mode argument.
>
> What I intended with my current proposal for constrained FP intrinsics is
> that the rounding mode argument will simply tell the optimizer what
> assumptions it can make about the rounding mode at any given time and if
> additional instructions are needed to set the rounding mode those will have
> been inserted separately (by the front end).
>
> I'm not familiar with the AMD GPU architecture, but for X86 targets having
> the intrinsics guarantee either ftz mode or rounding mode could require, in
> a worst case scenario, that an extra LDMXCSR instruction (or the x87
> equivalent) be inserted before every floating point operation.  We could,
> theoretically, eliminate this instruction in cases where we could prove that
> the mode was already what we wanted it to be, but my preference would be to
> have the front end insert calls to set the FP environment where needed and
> simply allow CodeGen to rely on the fact that the mode was already set.
>
> Does this present a conflict with the needs of the AMD GPU architecture?

This sounds like the same thing to me? The mode register could be reset for
every instruction in the worst case. I thought the proposal for the rounding
mode was to allow specifying the specific rounding mode that operation would
use, as well as whether it should use the dynamic rounding mode.

Are you saying that the lowering for an llvm.contrained.fadd for example with a
known constant round.downward argument would not be responsible for inserting
the mode set instruction?

Quuxplusone commented 7 years ago

(In reply to comment #8)
> Are you saying that the lowering for an llvm.contrained.fadd for example
> with a known constant round.downward argument would not be responsible for
> inserting the mode set instruction?

Yes, that is what I was suggesting.

I have to admit that my thinking on this is very much coupled with how I
imagine it being implemented by a C/C++ front end, but I think something
analogous should be practical with other front ends as well.

The way I imagine it, in C, there are basically two ways that the rounding mode
could be set:

1) A pragma like "STDC FENV_ACCESS ON" is used to enable fenv access and an
explicit call to fesetround() is used to set the rounding mode.

2) A pragma like "STDC FENV_ROUND FE_UPWARD" is used to specify the rounding
mode for a specific scope.

In case 1, I intend that the front end would translate all FP operations using
the constrained intrinsics with the rounding argument set to "round.dynamic"
and (eventually) we'd have a pass that would look for calls to fesetround() or
equivalents thereof and change the rounding mode argument in instructions for
which it could prove the rounding mode was known and constant.

In case 2, it is my understanding that the pragma itself is intended to have
the effect of setting the rounding mode within the scope for which it applies,
but (at least in the standard draft I've been using as reference) it suggests
that a compiler could implement this by inserting explicit calls to
fegetround()/fesetround() in appropriate locations to bracket FP operations
affected by the rounding mode (changing back to the previous mode for calls out
of the scope).  This matches what I had in mind.  I intend that in this case
the front end would insert calls to set the rounding mode and use whatever
constant rounding mode is specified for FP operations within the given scope.

In the general (front end agnostic) scenario, other means of specifying a
rounding mode (such as a command line option) are possible of course, but I
think they could be mapped to one of the two cases above -- either (1) the
rounding mode is unknown to the front end and must be set using an explicit
instruction/call or (2) the rounding mode is known to the front end and the
front end can insert calls/instructions to set it if necessary.

In my mind (with an admitted X86 bias), flush-to-zero can be implemented in the
same way.  Of course, this assumes that there is an instruction of some sort
that is independent of the general FP instructions and can be executed to set
the rounding mode and/or flush-to-zero mode (such as LDMXCSR in the SSE
instruction set).  I suppose an architecture that needed to encode these modes
directly into an instruction sequence in some way could still do that during
ISel even if we did not require such lowering in our language specification for
the intrinsic.

By the way, here's the standard draft I mentioned above: http://www.open-
std.org/JTC1/sc22/wg14/www/docs/n1778.pdf

Quuxplusone commented 7 years ago

(In reply to comment #9)
>

This all matches how I think ftz should also be handled

Quuxplusone commented 7 years ago

(In reply to comment #10)
> (In reply to comment #9)
> >
> This all matches how I think ftz should also be handled

Excellent.

The biggest potential problem I see remaining is how to handle the transition
through ISel.  The implementation in my current patch doesn't preserve the
rounding mode and exception behavior arguments beyond DAG formation.  It could
easily do so, but since my implementation of MutateStrictFPToFP() was going to
ignore them anyway, I just left them out.

My reasoning is that once the target registers that hold the actual FP
environment (e.g. MXCSR) are modeled in the machine instructions there would be
no need for the extra arguments.  This assumes no constant folding is attempted
beyond ISel, or at least that suppressing it when the FP env register is
modeled is acceptable.

However, if you think you will need to know the FTZ mode for instruction
selection, then we'll need to keep at least that argument, and we'll need a
target specific way to hook in to the MutateStrictFPToFP() handling.

Can you please comment on the code review if you think something additional is
needed for your case?

Quuxplusone / LLVMBugzillaTest

We need a way to specify FP denormal behavior on a per-instruction basis #30258