chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Compiler Built-in Mathematical routines - Long Term - Not Urgent #21043

Open damianmoz opened 1 year ago

damianmoz commented 1 year ago

Introduced not to make more work in the short term but to raise the issues involved so that they are considered in decisions made going forward.

Modern C compilers (try and) treat several fundamental mathematical routines as 'built-in'. These are those with the functionality (and a 64-bit draft C23 name) as follows:

    FMA - fused multiply and add (fma)
    ABS - absolute value of a floating point number (fabs)

    square root (sqrt)

    truncate towards zero (trunc)
    round to nearest with ties away from zero (round)
    round to nearest with ties to even (roundeven)
    round towards positive infinity (ceil)
    round towards negative infinity (floor)
    round according to the current rounding direction (rint)

    minimum of two floating point numbers of one or more flavours (fmin)
    maximum of two floating point numbers of one or more flavours (fmax)

    transfer the sign of one floating point number to another (copysign)
    get the negative bit of a floating point number (signbit)

These compilers implement such routines with a subroutine call using either a special primitive as would likely be the case with ABS and FMA, or the far simpler expedient of using a header file containing an inline C routine with (hopefully minimalist) embedded assembler, something really feasible only with more recent versions of the C language standard.

There are other routines that arguably could also be in that list:

    split a floating point number into an exponent and a signed factor

    ramp function (fdim) or some other sort of Heaviside function

    scale a floating point number by the radix raised to an integral power

        round to nearest with ties to odd

    inverse square root (rsqrt)

    other flavours of the minimum of two floating point numbers
    other flavours of the maximum of two floating point numbers

A flavour of the first of these is the C routine frexp, a routine that in the opinion of some does not fit modern needs, not least because it reflects floating point numbers of the 1970s!! The functionality of the last three is recommended by the latest IEEE 754 standard and appear in drafts of the next C standard.

That supplementary list is not exhaustive and deliberately does not include the routines that work with floating point exceptions and other aspects of the floating point environment. They are a whole new ball game, especially in the context of LLVM.

Long term, does Chapel try and simply leverage what the C standard provides, which is dictated by what is standardized by C17 or C23 or does it exploit its own more powerful (and arguably simpler) features and handle builtins itself???? Sometimes the quality of the routine that you get in a C library is sub-optimal and it would be good to avoid this. For example, the glibc version of the scaling noted above is arguably nearly 3 times slower than it needs to be.

There is at least one bigger issue here. Chapel is yet to address fused multiply/additions, something that in my humble opinion only the Rust language has done rigorously and consistently and thoroughly. So that needs to be considered here. Some ideas on this are discussed in #11335.

Food for thought!! And discussion. Not sure if this needs multiple issues. Its content will overlap (to some extent) other issues but the focus here is how to provide the aforementioned functionality such that any subroutine call overhead is avoided and optimal performance is achieved (at the expense of code).

bradcray commented 1 year ago

@damianmoz : Making sure I'm digesting this appropriately, I think what you're asking for is not (necessarily) a change in the interfaces of routines in your list above (when they exist and don't have other issues), but just an implementation strategy that will avoid function call overhead? Or, put another way, the user would generally be unaware of the change in this issue other than from the perspective of getting good performance? (again, ignoring cases we don't really have a story for currently like FMA)

damianmoz commented 1 year ago

Correct. No change in interface.

Yes to the implementation strategy question and yes to users would be unaware of the change other than from the perspective of getting better performance. And yes also to your comments about FMA. I am assuming ABS is handled properly already by Chapel and avoids function call overheads.

... At least from the perspective of LLVM which has an extern block. The strategy I have tested would not work with the GCC back end because (I do not believe) there is a way to potentially pull in an inline C function inside an inline Chapel proc in something like Math.chpl which is then used inside a Chapel program. So there is a problem to be solved unless Chapel moves to LLVM totally but LLVM has other issues related to floating point things that still seem a long way off fixing.

I cannot figure out exactly how things are done currently but I am certain Chapel has the hooks already to get it right. If C had possessed inline routines and supported generic programming 40 years ago, it would already be doing it in a way which was optimal for the routines it was already handling and was trivially extensible for the routines it has to handle into the future.

damianmoz commented 1 year ago

This is the structure which works with the LLCM backend.

extern
{
        #include        <math.h>
        #include        <ieee754.h>

        inline C_small_routine (....)
        {
                ....
        }
}

inline proc Chapel_routine(....)
{
        ....
        var z = c_small_routine(....)
        ....
        return .....;
}

....
        var x = Chapel_routine(....);

I have used this to implement the inline equivalent of the C23 routine roundeven and provide an inline version of the existing round routine which rounds to nearest with ties away from zero. For oth real(64) and real(32).

I know of no way of achieving this with the C compiler backend.

bradcray commented 1 year ago

I would expect that approach to work with the C backend as well, as long as the compiler was built with LLVM enabled (so, CHPL_LLVM=system or bundled but CHPL_TARGET_COMPILER set to something other than llvm like gnu or intel). Are you finding otherwise?

To support a non-LLVM-enabled compiler, I think the approach would be to move the code in the extern block to a .h file, to require that .h file from the Chapel source (or name it on the chpl command-line), and to provide an external declaration of it, like extern proc C_small_routine(...): .... That should cause the .h to be compiled in with the generated code and produce the same inlined behavior.

damianmoz commented 1 year ago

Silly me. I will get my act together and try it. Thanks for the words of wisdom.