chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

support overflow/underflow flag checking and clearing? #11969

Open mppf opened 5 years ago

mppf commented 5 years ago

As a Chapel user, I'd like to be able to check if my computation has encountered a floating point exception (e.g. divide by zero) and respond to it in my program. I'd like to do this at certain points with a minimum of performance impact.

IEEE 754 provides for floating point status flags that can indicate division by zero, overflow, and underflow. These are exposed to C programs in the fenv.h header. This issue is asking if Chapel can and should support analogues to feclearexcept, fegetexceptflag, feraiseexcept, fesetexceptflag, fetestexcept.

(This issue is not concerned with rounding modes).

This support could combine with compiler assumption about Nan/Inf propagation described in #11986.

gbtitus commented 5 years ago

Possibly related: would it be worthwhile for Chapel programmers to be able to initialize real variables with IEEE 754 signalling NaN values? The Cray Fortran compiler has long had this capability as a debugging and testing aid, via the -e i option. However, that sets all uninitialized real variables to signalling NaNs, which I think is overkill because Chapel already guarantees real variables won't have trash in them if they're never assigned to. But it might be worthwhile to be able to initialize a specific real variable (or array) with a signalling NaN, so as to guarantee that the program will halt if it references that specific variable without first assigning to it.

dmk42 commented 5 years ago

Note that due to current LLVM implementation issues (which should go away long term), although checking the flags can be made safe by wrapping the check in an opaque function, clearing the flags may be problematic. LLVM may hoist arithmetic up above the clear operation. For that reason, if we implement this near-term, users will have to be very careful about where in the program they clear the flags.

damianmoz commented 5 years ago

Setting sNaN is fine by me and I would prefer that Chapel guarantee this rather than its current behaviour and preferably inform me where I have been stupid enough to forget to initialize a variable. I thought it was good programming practice to never assume any language predefines anything. I do not think it is overkill. That said, just because a variable contains a signalling NaN does not mean the program will halt if it is used. You can reference/copy one of those safely. But If you use on of them inside floating point operation, they trigger IEEE exceptions which in turn can trigger a SIGFPE if you have so indicated in advance and @mppf wants to add something where an IEEE 754 exception also triggers a SIGFPE. Just my 2c.

damianmoz commented 5 years ago

What happens if I do a calculation and assignment purely to trigger underflow or overflow which is never subsequently referenced. The optimizer is likely to throw it away. Does Chapel support the concept of a volatile variable. A quick search of the language specification says there is nothing like it but maybe I was looking for the wrong thing.

mppf commented 5 years ago

What happens if I do a calculation and assignment purely to trigger underflow or overflow which is never subsequently referenced. The optimizer is likely to throw it away. Does Chapel support the concept of a volatile variable. A quick search of the language specification says there is nothing like it but maybe I was looking for the wrong thing.

I don't think we want volatile types/variables in Chapel. Instead, I'd just hope to have a function in a module that is documented to do this thing and that the optimizer knows can have side effects (or this particular side effect).

In particular I'd like to know why you'd want to trigger underflow/overflow with a computation rather than calling a function to directly set the underflow/overflow flag. Is there a problem you see with having a function to trigger the behavior?

damianmoz commented 5 years ago

I can live without the volatile type/variable. But I explicitly do not want a function in a module to do this because it would have jump/return overhead and would cause the optimizer to avoid certain things because it is worried about unspecified side effects. The only side effect is from the computation itself. A simple inline proc like forceEvaluation or even

inline proc guarantee(x : ?T) return x;

that is never optimized out of existence would be OK.

I have never raised a flag with an explicit call in my life and I certainly do not want to start now. I have never seen it done in quality code, certainly in routines/modules for use by others. It is extremely poor programming practice. I always do it into a totally portable fashion that involves zero function calls to confuse the optimizer. As I said, I could do it with something like

r += (r - r) * (expression that MUST be evaluated)

where the variable r is an active variable in the scheme of things but this is messy, this concept is not always feasible/possible, and I have to rely on the compiler not optimizing it away because it thinks it evaluates to zero.

Just to see from where you are coming, how would you test that a variable 'x' is insignificant to machine precision relative to another variable 'y'? The cleanest way is

if y + x == y then // x is tiny relative to y

I assume the optimize will handle this?

damianmoz commented 5 years ago

My previous suggestion of

strictly
{
   var x = expression that must be evaluated;
}

should solve this problem. Here that keyword tells the optimizer to do what I say and do not mess with any expression ordering. It might result in an extra store into memory on the stack but I can live with that overhead.

damianmoz commented 5 years ago

Actually just because something is defined as a var does not mean it gets copied to memory. Also, you could even say const in the above.

mppf commented 5 years ago

But I explicitly do not want a function in a module to do this because it would have jump/return overhead and would cause the optimizer to avoid certain things because it is worried about unspecified side effects.

I have never raised a flag with an explicit call in my life and I certainly do not want to start now. I have never seen it done in quality code, certainly in routines/modules for use by others. It is extremely poor programming practice. I always do it into a totally portable fashion that involves zero function calls to confuse the optimizer.

Is the issue here just one of performance? We could have a function in the module code that explicitly raises a flag and that is inlined after certain other optimizations occur. This is sortof like your guarantee idea - the point would be that the language doesn't need new keywords and new constructs, but the compiler could still know not to reorder or remove certain expressions. The guarantee idea is just more general, but I would probably prefer to see code intentionally raising an overflow flag to call a function named something obvious like raiseOverflow rather than doing an otherwise unreasonable floating point operation.

r += (r - r) * (expression that MUST be evaluated) I have to rely on the compiler not optimizing it away because it thinks it evaluates to zero.

if y + x == y then // x is tiny relative to y

I assume the optimize will handle this?

I don't know offhand in either case what existing optimizers (in particular LLVM optimizations) will do here. It's certainly worth studying at some point but at the moment I have to focus on some other things.

mppf commented 5 years ago

On a quick look at LLVM optimization specifically, it looks like:

https://llvm.org/docs/LangRef.html#floating-point-environment

The default LLVM floating-point environment assumes that floating-point instructions do not have side effects. Results assume the round-to-nearest rounding mode. No floating-point exception state is maintained in this environment. Therefore, there is no attempt to create or preserve invalid operation (SNaN) or division-by-zero exceptions.

The benefit of this exception-free assumption is that floating-point operations may be speculated freely without any other fast-math relaxations to the floating-point model.

Code that requires different behavior than this should use the Constrained Floating-Point Intrinsics.

Thus the LLVM manual is suggesting using llvm.experimental.constrained.fadd vs the fadd instruction in some cases. Even the regular fadd instruction has some options we'd presumably want to control: https://llvm.org/docs/LangRef.html#fast-math-flags .

The result here is that we'd need the Chapel compiler to identify which expressions should pay particular attention to the issue and which should not, since it will need to emit different LLVM IR in these cases. . The strictly { } idea is one proposal, another is something like the guarantee function to wrap an expression, and a third idea would be to somehow mark functions than needed careful attention. Of course we can use compiler flags (like we do now) but these probably aren't great for a real application as some parts of the application will need more careful attention than others.

damianmoz commented 4 years ago

Has anybody had any more thoughts on this long-in-the-tooth issue? Have LLVM enhancements made this any more feasible. How does LLVM handle the C construct

volatile x = a + b;

where even though x may never appear within that scope ever again, the evaluation itself will be guaranteed to be done. This is more than preserving the exception behaviour of the expression. It is preserving the expression itself when x may never get used ever again.

My underlying requirement was to actually trigger an IEEE 754 floating point exception. Forcing it portably by some arithmetic expression is accepted practice and clean. However, if an expression to force that exception cannot be made to effect the return value in some way, it is likely to get optimized out of the picture. Achieving that affect on the result is often achievable, but sometimes it will have an unacceptable overhead.

An alternative it to attack the register itself which is obviously going to be chipset dependent. This would be invokable by something like

ieee754raise(ieee754except.inexact);

But the routine itself would have to be inline'd assembler so that the optimizer did not have to worry about handling variables across subroutine call. Inline assembler raises even more issues.

damianmoz commented 4 years ago

Not urgent nor critical. Just did not want to loose it.

damianmoz commented 4 years ago

Hopefully, my brain has recovered from the New Year.

That strictly block not only would guarantee that a computation is done but it should also enforce evaluation ordering within an assignment. I am not sure whether or not to demand that it enforce ordering between statements as that would be impractical until such time as Chapel generates its own assembler directly. That said, statements that occur after a strictly block should be forced to happen after. Not sure how to enforce this in the C code that Chapel creates.

That said, if as @dmk42 says, clearing of the flags may not be guaranteed to be done in-place, i.e. arithmetic may be hoisted to occur before the clearing of the flags. So my concept may still be difficult to implement.

I would like to address the question by @mppf .... in particular I'd like to know why you'd want to trigger underflow/overflow with a computation rather than calling a function to directly set the underflow/overflow flag. Is there a problem you see with having a function to trigger the behavior?

Consider the case where I have suggested

strictly
{
   const inexact = x + x.ypsilon; // ypsilon is 2^(p-1) & p is 52 for real(64)
}

Assuming that a Chapel compiler can be made as smart as a C compiler, this will generate some X86-64 code (with GCC) like

addsd   .LC0(%rip), %xmm0    // add a param and a variable - this triggers the exception
movsd   %xmm0, -8(%rsp)        // stick it on the stack somewhere - it has to go somewhere!

That is the status quo as C would do it.

The clear and concise alternative as suggest by @mppf is to do

ieee754raise(ieee754.inexact);

That code snippet is certainly very clear so I really do like the idea. If Chapel supported this, it would put it ahead of the pack in terms of handling IEEE 754 operations.

But, and it is a big but, I can accept this if and only if I knew this would be done inline and would not compromise the optimizer. For an X86-64, it needs the following code (or equivalent) inline, and must ensure that is was ineffectual on the optimizer, and guaranteed to remain in-place:

......copy ieee754.inexact into a 32-bit unsigned register ......and said register with 0x3f for safety ......store FROM the *mxcr register onto the stack (memory location) ......or said register with stack (memory location) ......load the mxcr register from the stack (memory location)

The call needs 5 assembler instructions compared to the original case which was 2 instructions. Note that I cheated a bit as the code to force an underflow would be about 4 or 5 instructions although, all in registers, not like the above which has 5 accesses to memory. Yuk!

Note that we also need the ability to do other things like

ieee754clear(ieee754.inexact);

if ieee754test(ieee754.inexact) == ieee754.exact then 

although whether they have the same restrictions is open for debate.

I think that your idea is a lot more complex than mine because it needs a way to avoid not only a real function call but also to avoid compromising the optimizer's actions. And it is not cheap in terms of cycles/latency. I would love to be proved wrong. You may have ways around this that are far beyond my level of knowledge.

mppf commented 4 years ago

I don't have much to add at the moment. I don't know to what extent the LLVM issues are resolved in the LLVM project. Regarding the ieee754raise, that seems more appealing to me, and the strictly block is actually harder to get right in the compiler before we get to LLVM or C code. Once we are in LLVM or C code, I think the strictly block and the ieee754raise are equally hard. Presumably you would want your strictly block to also be OK for optimization. Getting consistent inlining is not particularly challenging.

damianmoz commented 4 years ago

Can you enlighten me a bit please so I can better answer your comments. It certainly looks like my strictly concept is too hard. I think I want to understand how to mess with standard/Math.chpl and runtime/include/chplmath.h.

Does declaring something within Chapel as

extern proc thing(x: int(32));

and invoking it as

thing(x);

simply get translated into a call to

thing(x);

in the C output stream.

And if so, if there was a header pulled into the C output stream that includes the definition

inline void thing(int x)
{
    // body of thing
    asm(....);
    /// more asm();
}

does the body of thing just work as it is supposed to when thing is invoked.

If so, your earlier idea of a

    guarantee(<floating-point-expression>)

would be easy to implement. And almost ditto for things like

raiseinvalid();
raiseinexact();
raisedivbyzero();
raiseunderflow();
raiseoverflow();

Each of those would be, on an X86-64, 3 assembly instructions, all with memory accesses.

There are issues with a generic

raise754exception(param x : int(32))
raise754exception(const x : int(32))

because the value of x is the very least a function of the architecture. I do not know how to do that from within Chapel, at least is it will work for the param case.. Then again, I can look to how the INFINITY constant is handled for some guidance.

damianmoz commented 4 years ago

It was forcefully put to me in one of our internal discussions that the implementation of the raising of an exception by the method we are proposing is highly non-portable, even if it is very clear. And messing with assembler is not my forte. And anything out-of-line kills optimization so that is out of the question.

The achievement of the same effect by using an inline version of guarantee, i.e. two routines called guarantee64 and guarantee32 somewhere inside the file runtime/include/chplmath.h is totally portable and uses long accepted, if not so well documented, techniques within C. That does raise questions for when the Chapel compiler creates assembler natively, but that is a long way off.

Food for thought.

mppf commented 4 years ago

simply get translated into a call to thing(x); in the C output stream

yes and that call will even be inlined by the C compiler or by LLVM optimizations (depending on which backend is in use).

the implementation of the raising of an exception by the method we are proposing is highly non-portable

I don't understand why that would be, if you already know a way to write it portably in C using an expression along the lines of const inexact = x + x.ypsilon; // ypsilon is 2^(p-1) & p is 52 for real(64)

The achievement of the same effect by using an inline version of guarantee, i.e. two routines called guarantee64 and guarantee32 somewhere inside the file runtime/include/chplmath.h is totally portable and uses long accepted, if not so well documented, techniques within C

If you know how to write these inline functions in C might I suggest that you offer up implementations? It is still unclear to me what exactly they would do.

mppf commented 4 years ago

Also I wouldn't rule out the strictly { } idea just yet; we might discover that the other elements here do not do what we need them to. The main difference between guarantee and strictly seems to me that strictly applies to many statements but guarantee only applies to one expression. If you needed the compiler to not optimize the order of evaluation of certain floating point operations between statements - maybe we'd need to have strictly. Today, we don't limit such optimizations on statement boundaries.

damianmoz commented 4 years ago

Thanks for the feedback on strictly { }. I will leave that in your capable hands.

Dealing with your comment one prior to last, I was obviously unclear. The concept of forcing the expression with guarantee or some other mechanism is perfectly portable. I will send you an implementation later this week and you can tell me where I have gone wrong.

The concept of raising exception flags in an IEEE 754 environment is different for all architectures, and hence not portable. Every implementation I have seen is either written completely in assembler or uses some asm statements embedded in C code. The raise operations on many common chips use logic that is read the status register, modify one or more bits, and then update the status register. But all have subtle differences, some less subtle than others. Some are more succinct than others. My knowledge on the hardware-level floating point status and control idiosyncrasies is limited. And writing raw assembler or embedded assembler in a C program is not my strength. I will send you some ideas on this too although that is more a collation of work by others far more skilled in this area than I.

damianmoz commented 4 years ago

Can we assume that on X86-64, there will never be a need to support non-SEE FP, i.e. 80-bit floating point instructions? Also, I assume that Chapel will never run on an i386.

mppf commented 4 years ago

Dealing with your comment one prior to last, I was obviously unclear. The concept of forcing the expression with guarantee or some other mechanism is perfectly portable.

Ok. But what does this have to do with overflow/underflow flags (the topic of this issue)? Is the idea that one would call guarantee on some expression that raises a floating point exception? If so, I don't think that will necessarily work, with LLVM at least. My understanding of the LLVM documentation is that the "normal" floating point operations don't raise exceptions at all and one has to opt in to instructions that do. Hence something like strictly.

The concept of raising exception flags in an IEEE 754 environment is different for all architectures, and hence not portable.

I'm having trouble understanding this. In the issue description, I mentioned feraiseexcept. Is this standard C function not portable for some reason? Or does using it require some non-portable assumptions?

Can we assume that on X86-64, there will never be a need to support non-SEE FP, i.e. 80-bit floating point instructions? Also, I assume that Chapel will never run on an i386.

I think we can assume these things, yes.

dmk42 commented 4 years ago

Hi. Damian asked me some questions offline but then gave me permission to answer in this issue.

You mentioned wrapping flag testing in opaque functions to stop the optimizer being too agressive? What did you mean by this?

I meant a precompiled library function whose source code is not available to the back-end compiler at the time the Chapel code is being compiled. For example, in the code below, the function testoverflowflag() would have to be in binary object form to make sure the compiler never hoisted the flag test above the computation on which it depends.

if testoverflowflag(a*b) {
    // handle overflow
}

What I failed to take into account is that the overflow flag might still be contaminated with other computations that are hoisted above a*b. Also, this trick won't work forever anyway, as link-time optimizers get smarter and smarter about examining the object code.

Also you mentioned the optimizer might move arithmetic defined after clearing an IEEE 754 exception to occur before it? Do you care to elaborate?

I was referring to LLVM bug 6050, "Floating-point operations have side effects." You can see more details about the consequences by looking at that bug's listed duplicates.

For the C back end, if the back-end compiler might be Clang, then Chapel cannot guarantee that code won't be inappropriately moved past a setting/clearing of the floating-point flags.

For the LLVM back end, as Michael mentioned earlier, LLVM 5 introduced the constrained floating-point intrinsics, and the Chapel compiler could be made to generate those as a way to enforce the appropriate guarantees.

Will a C-optimizer also ensure that

statement-A;
{
statement-B;
}
statement-C;

will execute as statement-A then statement-B and then statement-C?

In general, no. It will preserve that order if it knows there are dependencies among the statements that require the order to be preserved, but LLVM bug 6050 prevents Clang from knowing that floating-point state counts as a dependency.

damianmoz commented 4 years ago

In answer to what @mppf said, it was a question which popped up when I was considering how best to test for the floating point exception to be triggered by the expression sent to my soon-to-working guarantee routine. Apologies for not putting in the context of the issue. I realzsed the vagueness of the question as soon as I posted it .... which I why I sent a subsequent question to @dmk42 in background so he could figure out how to better put this in context.

Primarily I was also using some in-line assembler buried in an inline C function in my testing of the C equivalent of what my Chapel is doing. They are drop-in inline replacements for *feraiseexcept and fetestexcept, just is case it is deemed worthy of pursuing the concept of explicitly raising an exception.

Note that I am using an INEXACT exception that occurs when you truncate an integer for my testing because I can write a really really simple test case for that. Overflow/Underflow is more complex to generate and I want to focus on the problem.

Between the constrained floating point intrinsics (around which I am trying to wrap my head) and LLVM bug 6050, this is not the simplest of problems to solve. My strictly suggestion was made without realizing how truly difficult it would be to implement within the compiler chain while still letting the optimizer do what it is designed to do.

damianmoz commented 4 years ago

I have guarantee working. Thank you @mppf. Your explanation of how to get it into the runtime was very clear and easy to follow.

With a gcc backend, I have some test cases working without '--fast'. These test whether floating point exceptions are generated by the expressions that are fed to guarantee and which are crafted that they will cause an exception. They work but it is very erratic. I used the LIBC routines which need subroutine calls to both test the exception state or clear the exception state. And I know of no way to check the assembler to see what is really happening.

I also wrote inline replacements for fetestexcept, feclearexecpt and feraiseexcept. They use embedded assembler so will run quicker than a precompiled library routine. Also erratic and in some cases worse.

And as soon as I use '--fast' with chpl, it is pretty well is a waste of time even testing whether exceptions have occurred. Again, I really need to look at the assembler to see what is going on.

At the very least, I think I have proven that Chapel will raise the exceptions correctly when it should by feeding the guarantee routine specifically crafted arithmetic expressions that I know trigger an exception.

But I think that Chapel's correct testing, clearing, and explicitly raising of IEEE 754 exceptions goes into the to-do list. It is not urgent but it is something that really needs to be done if Chapel is going to play in the big league. I will write some really simple native C test code that I know works and the Chapel equivalent and you can use that in your testing. But not this week.

mppf commented 4 years ago

My strictly suggestion was made without realizing how truly difficult it would be to implement within the compiler chain while still letting the optimizer do what it is designed to do.

I think it's important to consider what language design is desired without too much consideration for how hard it is to implement.

And as soon as I use '--fast' with chpl, it is pretty well is a waste of time even testing whether exceptions have occurred.

You would presumably want to pass also --ieee-float for this sort of thing.

At the very least, I think I have proven that Chapel will raise the exceptions correctly when it should by feeding the guarantee routine specifically crafted arithmetic expressions that I know trigger an exception.

At least with GCC.... I think the C compiler used probably has more to do with this than the Chapel compiler.

But I think that Chapel's correct testing, clearing, and explicitly raising of IEEE 754 exceptions goes into the to-do list.

Agreed. We have a lot of work to do in order to get #11335 completed and it hasn't yet reached the top of the work list.

I will write some really simple native C test code that I know works and the Chapel equivalent and you can use that in your testing.

That would be very helpful, thanks.

damianmoz commented 4 years ago

I moved the testing of each individual case out of proc main' and into a separate proc test'. It appears to remove the unreliable results.

Summarizing

It would appear that the simple expediency of guarantee works. No change to the compiler in needed. Just a little addition to a run-time include file. Nice. Simple. I will keep testing.

The guarantee routine is fed an expression that is crafted to trigger an exception. That expression will never to be used again but guarantee must ensure that it is evaluated. This is the only truly portable way of triggering an exception when later parts of the code are not dependent on this particular expression. The alternative is tweaking a floating point status register which is machine-dependent between X86-64, Sparc64, P0owerPC64, ARM, Risc-V, and others, let alone any new players. That said, I have tested this tweaking alternative for X86-64 - see #1 below.

Note that the inline C routine that is called by guarantee relies on using a volatile variable in its one line C implementation.

The tests have been run with and without --fast. The procedure from which guarantee is called is being done has been tested as being inline and not. I have run the tests against the precompiled routines fetestexcept and feclearecept from LIBC and my own inline replacements which contain embedded asm statements to tweak bits in the Xeon's SSE mxcsr register.

I will post the source code once I improve the formatting of my results. My example is not exhaustive but it is simple. All I am doing is what the trunc routine does, making it really easy to check the results. When truncating a real(w) which has no integral representation, e.g. 5.7, the trunc routine was recommended, but not mandated, to raise an INEXACT exception - see #2 below.

1:

Tweaking the floating point status register needs its own discussion relating to the fact that the underlying float-point state is a dependency, such a side-effect having all sorts of compiler optimization issues. Additionally, most architectures seem to have a 40%-400% overhead when tweaking the register directly. I will do a discussion memo on this topic in the future but it will be along the lines of what is needed an optimizing compiler rather the the numerics. And #11335 is not at the top of the pile yet so I do not have to do it tomorrow.

2:

I used INEXACT instead of OVERFLOW and UNDERFLOW for two reasons. Firstly, It is easier to create a simple example. Secondly, while there are case when an approach like guarantee is the only solution, I have found (with some painstaking research I might add) that for my original test case that I can generally 'bury' the arithmetic I need to trigger an OVERFLOW or UNDERFLOW into the result so I do not need guarantee. Some purist library writers detest even a small overhead for OVERFLOW and UNDEFLOW exception creation and having an overhead anywhere will always have its detractors among users. For the super fussy, the IEEE 754 standard changed the scenario for trunc so it has, for the last 2 months since the new standard was released, said not to raise an INEXACT exception for *trunc but that is irrelevant to me doing so in my test case.

damianmoz commented 4 years ago

@mppf, files sent as a zip file directly to you. My browser only lets my upload .pdf and .txt files in the attachment window and they are a bit long to include in the body.

Files are a documented almost self-contained test program, the runtime/include/chplmath.h extensions as a diff, a nicely formatted version of that test program as a PDF, and the latest ieee754.chpl module.

I am sure this is not the end of the story as things may break under more aggressive optimization. And there may be style and nomenclature issues at the very least. But it is a starting point.

mppf commented 4 years ago

@damianmoz - thanks for your example test program. I changed the diff into a simple header file mymath.h (removing the non-C bits) and then compiled it with e.g. chpl trunc.chpl mymath.h. I verified that the test program produced the same output you sent using the current master branch in Chapel:

I ran into two issues:

  1. I needed to add use IO; to the top of the test program due to recent changes
  2. I saw warnings-as-warnings like this from the LLVM backend configurations which I disabled with --ccflags -Wno-unused-variable:
    error: unused variable 'tsave' [-Werror,-Wunused-variable]
        volatile float tsave = t;
mppf commented 4 years ago

A syntactical alternative to strict/strictly would be to attach floating-point reordering controls to a generalized attribute syntax (#14141 ).