Open damianmoz opened 1 year ago
@damianmoz : Making sure I'm digesting this appropriately, I think what you're asking for is not (necessarily) a change in the interfaces of routines in your list above (when they exist and don't have other issues), but just an implementation strategy that will avoid function call overhead? Or, put another way, the user would generally be unaware of the change in this issue other than from the perspective of getting good performance? (again, ignoring cases we don't really have a story for currently like FMA)
Correct. No change in interface.
Yes to the implementation strategy question and yes to users would be unaware of the change other than from the perspective of getting better performance. And yes also to your comments about FMA. I am assuming ABS is handled properly already by Chapel and avoids function call overheads.
... At least from the perspective of LLVM which has an extern block. The strategy I have tested would not work with the GCC back end because (I do not believe) there is a way to potentially pull in an inline C function inside an inline Chapel proc in something like Math.chpl which is then used inside a Chapel program. So there is a problem to be solved unless Chapel moves to LLVM totally but LLVM has other issues related to floating point things that still seem a long way off fixing.
I cannot figure out exactly how things are done currently but I am certain Chapel has the hooks already to get it right. If C had possessed inline routines and supported generic programming 40 years ago, it would already be doing it in a way which was optimal for the routines it was already handling and was trivially extensible for the routines it has to handle into the future.
This is the structure which works with the LLCM backend.
extern
{
#include <math.h>
#include <ieee754.h>
inline C_small_routine (....)
{
....
}
}
inline proc Chapel_routine(....)
{
....
var z = c_small_routine(....)
....
return .....;
}
....
var x = Chapel_routine(....);
I have used this to implement the inline equivalent of the C23 routine roundeven and provide an inline version of the existing round routine which rounds to nearest with ties away from zero. For oth real(64) and real(32).
I know of no way of achieving this with the C compiler backend.
I would expect that approach to work with the C backend as well, as long as the compiler was built with LLVM enabled (so, CHPL_LLVM=system
or bundled
but CHPL_TARGET_COMPILER
set to something other than llvm
like gnu
or intel
). Are you finding otherwise?
To support a non-LLVM-enabled compiler, I think the approach would be to move the code in the extern
block to a .h
file, to require
that .h
file from the Chapel source (or name it on the chpl
command-line), and to provide an external declaration of it, like extern proc C_small_routine(...): ...
. That should cause the .h
to be compiled in with the generated code and produce the same inlined behavior.
Silly me. I will get my act together and try it. Thanks for the words of wisdom.
Introduced not to make more work in the short term but to raise the issues involved so that they are considered in decisions made going forward.
Modern C compilers (try and) treat several fundamental mathematical routines as 'built-in'. These are those with the functionality (and a 64-bit draft C23 name) as follows:
These compilers implement such routines with a subroutine call using either a special primitive as would likely be the case with ABS and FMA, or the far simpler expedient of using a header file containing an inline C routine with (hopefully minimalist) embedded assembler, something really feasible only with more recent versions of the C language standard.
There are other routines that arguably could also be in that list:
A flavour of the first of these is the C routine
frexp
, a routine that in the opinion of some does not fit modern needs, not least because it reflects floating point numbers of the 1970s!! The functionality of the last three is recommended by the latest IEEE 754 standard and appear in drafts of the next C standard.That supplementary list is not exhaustive and deliberately does not include the routines that work with floating point exceptions and other aspects of the floating point environment. They are a whole new ball game, especially in the context of LLVM.
Long term, does Chapel try and simply leverage what the C standard provides, which is dictated by what is standardized by C17 or C23 or does it exploit its own more powerful (and arguably simpler) features and handle builtins itself???? Sometimes the quality of the routine that you get in a C library is sub-optimal and it would be good to avoid this. For example, the glibc version of the scaling noted above is arguably nearly 3 times slower than it needs to be.
There is at least one bigger issue here. Chapel is yet to address fused multiply/additions, something that in my humble opinion only the Rust language has done rigorously and consistently and thoroughly. So that needs to be considered here. Some ideas on this are discussed in #11335.
Food for thought!! And discussion. Not sure if this needs multiple issues. Its content will overlap (to some extent) other issues but the focus here is how to provide the aforementioned functionality such that any subroutine call overhead is avoided and optimal performance is achieved (at the expense of code).