chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.77k stars 416 forks source link

Chapel floating point and IEEE 754 #11335

Open mppf opened 5 years ago

mppf commented 5 years ago

IEEE 754 specifies floating point behavior for programs. However it's up to a language how much of IEEE 754 to follow, and how to map that document's rules to program statements.

To what extent does Chapel need to be strict about floating point optimizations?

See also PRs #1593 for previous discussion and PRs #11322 and #11332 which are adjusting LLVM optimization levels for floating point.

Summary of discussion:

dmk42 commented 5 years ago

My opinion is that full IEEE compliance should absolutely be available. It could be the default but need not be as long as it is available. Beyond that, I'm agnostic about where to draw the line.

mppf commented 5 years ago

@dmk42 - I might be misunderstanding, but I don't think IEEE 754 defines how Chapel statements correspond to floating point operations. Wouldn't we have to define that in order for "strict floating point" to have any meaning?

dmk42 commented 5 years ago

How to map arithmetic operations to Chapel operators seems pretty obvious. The only other thing I can think of that you might be referring to would be whether or not to honor parentheses within a statement when in IEEE mode. I think we ought to have that as a long-term goal, though it needn't be high priority.

Was there something else you had in mind?

mppf commented 5 years ago

There are lots of questions. If the questions and answers are obvious to you, maybe you can write them down in more detail?

Anyway here are a few to get started. Suppose a, b, c, d, ... are floating point numbers.

If I have

var x = a*b + c;

I'd imagine we could do a fused multiply-add. What if it were

var x = a*b;
x += c;

can we do a fused multiply-add now?

Or what if we have

var Array:[1..n] real;
for x in Array {
  x = a*b*c + x;
}

Can we pull a*b*c out of the loop? Is the answer the same if it is this:

var Array:[1..n] real;
for x in Array {
  x = x + a*b*c;
}

?

What if it's addition instead of multiplication?

var Array:[1..n] real;
for x in Array {
  x = x + a + b + c;
}

Put another way, even if we choose to support an "IEEE" mode, I don't think it's obvious how Chapel statements turn into CPU operations / instructions. After all that's why the compiler is non-trivial, isn't it?

Beyond all that, I don't know what the goals the authors of IEEE 754 had in mind but I'd hope for deterministic program behavior on different architectures. But there are problems getting that because when the compiler decides to keep something in memory vs. a floating point register; or whether it decides to vectorize; can impact the precision of the floating point operations. (Intel floating point operations work at 80 bits, right?). I think it would be pretty much insane if our IEEE mode rules dictated when a variable could be stored in a register. For this reason I tend to view the goal as "Do the best we can with the hardware we're running on" rather than to match a standard.

dmk42 commented 5 years ago

OK, thanks. That helps. I can see that I was making assumptions in a couple of areas that maybe you weren't.

I was assuming quiet floating-point exceptions (no traps). With traps, a strict IEEE mode might have to be extra careful about when it can do LICM. When FP exceptions just quietly set a flag, LICM is a normal thing to do.

I was also assuming that we should not have IEEE requirements that are incompatible with C's IEEE mode (Annex F of the C standard), because C is one of our back ends. We probably don't need to support FENV_ACCESS (accessing the floating-point environment to read exception status, for example) until such time as we have a user request for it, which may never come. That means we can do algebra within a single statement, even in IEEE mode, except for a few corner cases that would not handle NaNs properly if reorganized. Just to pick one from subclause F.9.2 of the C standard:

The transformation 0 * x -> 0.0 would give the wrong answer if performed within strict IEEE mode because the two expressions are not equivalent when x is a NaN, infinite, or negative zero.

I mentioned parentheses earlier. That's a Fortran thing. It seemed natural to me for a Chapel strict-IEEE mode. To remain compatible with the C back end if we take that approach, we would need to break each parenthesized subexpression into its own C statement.

It makes sense to allow fused multiply-add within the same statement. Most target C compilers will also allow it across statements even in IEEE mode, so to maintain compatibility with one of our back ends, we probably should allow it across statements too.

mppf commented 5 years ago

@dmk42 - it sounds like you think the fused-multiply-add and the LICM of the multiplication cases should be allowed, but what about the case

What if it's addition instead of multiplication?

Isn't there concern that LICMing a + b + c would change the order of operations i.e. reassociate?

Also can you provide an example of what you mean with the parentheses?

You mentioned no traps. What about changing rounding modes? Is that something we'll support, and if so what does it look like?

Lastly I don't have IEEE 754 in front of me, but Wikipedia summarizes it's impact on expression evaluation in this way:

The standard recommends how language standards should specify the semantics of sequences of operations, and points out the subtleties of literal meanings and optimizations that change the value of a result.

I.e. there is still something we'd have to do in our language specification if we want to claim IEEE 754 support, isn't there?

dmk42 commented 5 years ago

Order of operations is different from order of computation (associativity).

var Array:[1..n] real;
for x in Array {
  x = x + a + b + c;
}

All the + operators have the same precedence. We can reorder those at will. That's why LICM works for this. The Fortran parenthesis rule is that if we say

  x = (x + a) + b + c;

then we are putting limits on the reordering that we can do. Another way to say this is the following.

  x = x + a;
  x = x + b + c;  // separating into two statements only makes a difference in IEEE mode

If we decide to support changing rounding modes, then each change creates a barrier beyond which floating-point code cannot be moved. The reason is that most floating-point units serialize on the mode change. I've heard that there are machines that encode the rounding mode into the instruction instead of setting some state, but I haven't actually seen one. In C, the rounding mode is manipulated by calling fegetround() and fesetround(), which require the following pragma to be set.

#pragma STDC FENV_ACCESS ON

This is part of what I mentioned earlier. Chapel support for FENV_ACCESS is probably best deferred until we get a user request for it. It is a lot of work for something that might not be part of what a Chapel user wants to manipulate. If we have a solid use case for it, though, we should go ahead and implement it.

Yes, it would be a good idea to document the changes in behavior that occur in IEEE mode, similar to what C has done with its Annex F.

mppf commented 5 years ago

Does Fortran allow reordering across statements unless parentheses are used? Would you want both the parentheses rule and a statement rule?

dmk42 commented 5 years ago

The parenthesis rule doesn't provide more functionality than a statement rule, so we don't have to have it. It seemed natural to me for the Chapel community where it isn't natural for C, but we could do without it. Now that you mention it, I'm not sure how far Fortran goes beyond the parenthesis rule, or whether any such requirement is in place outside of an IEEE-compliant mode.

dmk42 commented 5 years ago

By the way, another good reason to defer FENV_ACCESS support until later is that LLVM has a bug where it will move floating-point operations across points in the code that change the floating-point environment. So, for example, setting the rounding mode sometimes works and sometimes doesn't in LLVM, because the operation you're trying to affect might move inappropriately.

mppf commented 5 years ago

@damianmoz - I understand you're interested in this discussion.

damianmoz commented 5 years ago

OK, starting to understand this

My general statement is that if Chapel does not honor as much of IEEE754 as do Fortran or C/C++, then this seriously affects its acceptance as a viable alternative.

My attitude to ordering is that I never assume anything. I have two arbitrary length FFT programs, my own written in C++ and the other in Fortran, doing complex arithmetic. With exactly the same ordering of statements. I get different results due to the way complex arithmetic is done it different. The error in any element of the FFT'd vector is always under N*epsilon where N is the length of the vector being FFTed so I know my C++ version of the Fortran is correct. My attitude is that I have to live with that level of differences. They are both IEEE754 compliant code but that still does not guarantee they get the same result.

I like to access my FE environment occasionally so having FENV_ACCESS work properly at some stage would be good ..... please. But within the code, my usage is more to reset/change exceptions, not to chop and change rounding modes. Anybody who switches rounding modes within a single execution run should know what to expect and realise that it will serialize at those points so such serialization is acceptable. Or have I missed the point, or cannot Chapel detect the need to serialize at that point?

I would like to assume that if I have two sequential statements that are not told to run in parallel, that this order is respected. I mess with statement ordering myself and look at the assembler to see what statement ordering results in the best code. But when it comes to how you order a given statement, as long as you respect my brackets, the rest is the compiler's option. I have never thought of the parentheses rules as being IEEE754 related. I thought it predated the standard.

Are brackets in a single statement currently respected by Chapel?

As far as FMA goes, can you make the use of this, or otherwise, a compiler option.

I hope no optimizer ever finds that I personally have left a loop invariant expression inside a loop. So LICM should have no work to do on my code (famous last words). Or does LICM start to assume a lot more importance across multiple locales or when there is any other form of parallemism involved? Or this question about LLVM (about which I know zip).

Hope the above was considered and constructive comments.

mppf commented 5 years ago

Just trying to summarize my understanding of the comments by @damianmoz and @dmk42:

@damianmoz - regarding this:

But within the code, my usage is more to reset/change exceptions, not to chop and change rounding modes.

Can you say more about what you do with floating point exceptions? In particular, when the FPU runs into an invalid operation (e.g. divide by zero), my understanding is that in C one can choose among the following programming models:

Which of these is interesting to you? (Some of these are harder to make reasonable in Chapel than others, because of many tasks/threads).

damianmoz commented 5 years ago

MPPF - Can you have a look at the following and maybe change your first statement?

I would expect that the results of a trivial expression done on any two IEEE754 compliant systems in any language (or languages) of say

    var x = y * z; // where y and z are valid IEEE754 numbers

is identical, bit for bit, for the same precision of x, y and z. DMK42 might like to correct me but I would expect this.

I would also expect that the results of a complex chunk of code done on any two IEEE754 compliant systems in any language (or languages) would still fall within the error bounds of an error analysis i.e. if the error in some variable f after computations at some point in the code is say

    35 * epsilon

then the values of the IEEE754 float point number computation of f at the same point in the code in any language(s) on any system(s) should satisfy

    |f(done on system A) 

More to come

damianmoz commented 5 years ago

Sorry, that last statement should say

    | f (done on system A) - f (done on system B) | <= 35 * epsilon
damianmoz commented 5 years ago

I assume that something like

    var n = (x - x) / (x - x); // x is a valid IEEE754 number
    var i = 1.0 / (x - x); // 1.0 / 0.0 should be rejected

will yield the variable t as NaN and the variable i as Inf. I do not want it to be anything else.

I always assume that the FP exceptions are stored in an FP status word and can be queried / cleared by fetestexcept & feclearexcept or whatever Chapel wants to call them.

I have not used SIGFPE traps for ages so I am unqualified to answer. That is not IEEE754 is it? Maybe @dmk42 has a better. (Sorry @mppf, I cannot seem to get GitHub people links to work).

damianmoz commented 5 years ago

Looks like the links work once I hit comment, but do show up in either the Write or Preview window.

Better in the above should have been 'better answer'.

damianmoz commented 5 years ago

My comment on traps is wrong. the IEE754 standard recommends optional exception handling in various forms, including the presubstitution of user-defined default values, and traps (exceptions that change the flow of control in some way) and other exception handling models which interrupt the flow, such as try or catch. The traps and other exception mechanisms are optional. So @mppf, you are correct that an option to allow the FP exception to trigger a trap is IEEE754 compliant, although it is not mandated. I should have checked more thoroughly before I wrote my earlier comment.

mppf commented 5 years ago

@damianmoz - No worries. I'm more interested in your perspective as a user of these features than I am in what is or isn't in IEEE754. I think IEEE754 actually allows quite a bit of flexibility (but I may be wrong about that) for implementations to be "compliant". Either way, it's important for us to understand what are the most important elements as we go towards making language and implementation choices.

I have not used SIGFPE traps for ages so I am unqualified to answer.

My question is - would you want to use SIGFPE traps with Chapel? Or can we consider this element unnecessary for the time being?

(by the way you can edit your previous comments on GitHub (click the ... in the upper right of your comment).

mppf commented 5 years ago

@damianmoz - having the computation's results fall within certain error bounds is certainly a reasonable goal but I'm not so sure the trivial expressions are identical idea can work.

I would expect that the results of a trivial expression done on any two IEEE754 compliant systems in any language (or languages) of say

   var x = y * z; // where y and z are valid IEEE754 numbers

is identical, bit for bit, for the same precision of x, y and z. DMK42 might like to correct me but I would expect this.

First it's unclear to me if you meant that the result should be identical on different hardware. But in the particular case of x86 and the related floating point unit x87 the hardware presents a fundamental challenge to such trivial computations. My understanding of the situation is this (sources described below in [1] and [2]):

  1. Floating point numbers are stored in 80 bit registers but truncated to 64 bits when stored in memory
  2. These 80-bit registers are used by default, even for C double which is 64-bits in memory. The programmer does not "opt in" to the higher precision.
  3. That means that compiler optimization impacts the floating point result, because a less optimizing compiler might store everything in memory, while a better one would keep most of the computation in registers.
  4. Vector floating point instructions in x86 have 64-bit rather than 80-bit precision even as registers. This means that a compiler that chooses to vectorize a loop will cause that loop to compute with different precision than the un-vectorized loop.

Due to these factors, I don't expect that different existing languages and systems claiming IEEE 754 compliance will give bit-identical results for trivial computations.

[1]: see https://en.wikipedia.org/wiki/X87

By default, the x87 processors all use 80-bit double-extended precision internally (to allow sustained precision over many calculations, see IEEE 754 design rationale). A given sequence of arithmetic operations may thus behave slightly differently compared to a strict single-precision or double-precision IEEE 754 FPU

[2]: see https://en.wikipedia.org/wiki/SSE2

If codes designed for x87 are ported to the lower precision double precision SSE2 floating point, certain combinations of math operations or input datasets can result in measurable numerical deviation, which can be an issue in reproducible scientific computations, e.g. if the calculation results must be compared against results generated from a different machine architecture. A related issue is that, historically, language standards and compilers had been inconsistent in their handling of the x87 80-bit registers implementing double extended precision variables, compared with the double and single precision formats implemented in SSE2: the rounding of extended precision intermediate values to double precision variables was not fully defined and was dependent on implementation details such as when registers were spilled to memory.

damianmoz commented 5 years ago

On the question of SIGFPE, the answer is YES, I need it. Our own team use it all the time . They use C++ (and occasionally modify old Fortran) but they will start using Chapel a bit/lot more in 2019/2020.

Can we just call the routines from 'fenv.h' from within Chapel using some external C linkage?

dmk42 commented 5 years ago

@damianmoz - I'm not officially in today, but noticed this question going by and thought I had better save you some headaches. If you call the fenv.h routines, you'll want to make sure your target compiler is gcc. LLVM (and therefore clang) is not yet able to stop its optimizer from using code transformations that move floating-point code past fenv.h calls.

damianmoz commented 5 years ago

Thanks @dmk42 . Yes, I use a GCC backend day-to-day for the moment although that will change shortly when I upgrade that system.. I figured LLVM might be doing as you say which I guess is why @mppf is asking questions. That said, writing low level stuff where I really worry about IEEE754, I always read the assembler and if I do not like it, I rewrite the upper level code to get what I want. Also, I normally mess with fenv stuff sufficiently far away, and generally in a separate module, from where they impact. Hence I do not expect to have the problem to which you refer even with LLVM. Even so, I would still check the assembler.

damianmoz commented 5 years ago

@mppf, I am worried that you and I are talking to each other from opposite sides of a little stream in a forest. The following may be total rubbish, but anyway, here goes ...

My understanding of the situation is this (sources described below in [1] and [2]):

  1. Floating point numbers are stored in 80 bit registers but truncated to 64 bits when stored in memory

More like (as per your reference)... Up until the Intel 586, to work floating point numbers in memory, these would first be copied to an 80-bit register. These 80-bit registers are manipulated by X87 instructions. When data is copied back into memory, these 80bit quantities are truncated or rounded as per the (rounding mode active at the time) to 64bits.

What compilers do this by default these days please and on what architecture? Can you show me the assembler and the options you used to do this. Are you reading old books/manuals?

For ages, with gcc/g++ or ICC, you need to explicitly ask for that behaviour on an X86. In a C/C++ program, if you deliberately use any 'long double' (96-bit on X86) floating point numbers, it will use 80-bit instructions to work with those but that is understandable because nothing else will.

  1. These 80-bit registers are used by default, even for C double which is 64-bits in memory. The programmer does not "opt in" to the higher precision.

Long since not the case except on i586 and below. A program has been able to opt-out for nearly 2 decades in every compilers I used. In fact, just now, the only way I could 'opt-in' to that behaviour in gcc was to do use

-mfpmath=387

I am old enough to have had to come to grips when the default behaviour changed from what you say and what is today the case. Then, about 2000, I saw my own programs get different results to what it did on the same hardware previously.

I have no idea what LLVM does.

  1. That means that compiler optimization impacts the floating point result, because a less optimizing compiler might store everything in memory, while a better one would keep most of the computation in registers.

A user would have to explicit ask to use extended precision at invocation of 'chpl' to use extended precision temporaries which these days is only available on x86-64 type architecture. I think FreeScale threw it out on their Motorola 68k descendants ages ago. And I know of no other hardware which has them.

I wrote

proc multiple(x : real, y : real) return x * y;

in the vain hope of looking at Chapel's assembler and had a heart attack at the C-code. Let alone try and compile it to assembler.

  1. Vector floating point instructions in x86 have 64-bit rather than 80-bit precision even as registers. This means that a compiler that chooses to vectorize a loop will cause that loop to compute with different precision than the un-vectorized loop.

In my C++ code, I always get the same results from a vectorized or unvectorized loop because I never ask for the old X87 floating point instructions. Sure, ask for the old instruction set and I agree it will be different.

Throw the old 387 instructions out of your brain. Think SSE2 or above.

If a compiler generates the instructions to multiply two 64-bit double precision numbers together in double precision registers, which for an Intel is

mulsd   %xmm1, %xmm0

and do the equivalent for say a Sparc or an ARM, and those two registers start with the same data, all three are guaranteed to generate the same answer as long you have the same rounding mode active.

Similarly, if you start with 2 numbers in (80-bit) extended precision registers, say for a Motorola 68040 and a Xeon using x87 floating point instructions, you will get the same result in an extended precision multiply.

But in the particular case of x86 and the related floating point unit x87 the hardware presents a fundamental challenge to such trivial computations.

I think you need to be able to say categorically that Chapel will honor the results of an error analysis, at least as long as certain compiler options are active. Note that using 80-bit registers in X87 instructions should still achieve that. But why use/generate such ancient instructions?

My earlier statements said 'same precision'. I do not consider an n 80-bit register capable machine as having the same precision hardware as a machine with 64-bit-only registers.

mppf commented 5 years ago

I am worried that you and I are talking to each other from opposite sides of a little stream in a forest.

It's not so bad, some things here are just obvious to you but unknown to me. I'm just somebody who ended up talking to you about it - that doesn't mean I have the details right...

My earlier statements said 'same precision'. I do not consider an 80-bit register capable machine as having the same precision hardware as a machine with 64-bit-only registers.

This alone would address my concern. I was assuming that in the context of Chapel, we would assume that real(32) and real(64) etc. specify the precision. Which isn't quite what happens with this 80-bit business.

Up until the Intel 586, to work floating point numbers in memory, these would first be copied to an 80-bit register. ... For ages, with gcc/g++ or ICC, you need to explicitly ask for that behaviour on an X86.

Thanks for pointing out this is no longer common practice. It also helps me understand something that I didn't before. I've seen non-vectorized X86-64 floating point using XMM registers and wondered why they didn't use the "regular" floating point registers and instructions. I see now that it's the way to specify 64-bit floating point operations rather than 80-bit operations.

It's certainly possible that somebody might run a Chapel program on such older hardware, but it's not likely to be common, and any difference in the result can be explained by considering the 80-bit register to be using a different precision.

mppf commented 5 years ago

Regarding SIGFPE and LLVM optimizations:

That said, writing low level stuff where I really worry about IEEE754, I always read the assembler and if I do not like it, I rewrite the upper level code to get what I want. Also, I normally mess with fenv stuff sufficiently far away, and generally in a separate module, from where they impact.

Would requiring this approach be sufficient in the medium and/or long-term?

mppf commented 5 years ago

Also

On the question of SIGFPE, the answer is YES, I need it. Our own team use it all the time . They use C++ (and occasionally modify old Fortran) but they will start using Chapel a bit/lot more in 2019/2020.

Can we just call the routines from 'fenv.h' from within Chapel using some external C linkage?

Can you say more about what your programs do when they get a SIGFPE? (I.e. what does the signal handler do? Or is there no custom signal handler, so that SIGFPE simply quits the program?).

The reason I'm asking this is that I really doubt that it will make sense to translate C signal handlers into Chapel native constructs. At the very least it would be quite a bit of work to specify what can / can't happen in a signal handler and how these handlers interact with the user-level tasks.

damianmoz commented 5 years ago

@mppf , Looking at the 3 New Year replies, let me know if bundling my replies together.

I think we are the same side of the river now with precision. I think Chapel can ignore 80-bit stuff. For me, if I need anything in Chapel beyond 64bits, it will be simple, low level stuff and I can live with writing it in C and link to Chapel. @dmk42 has a job for a long time if he wants it. I would love a Chapel intrinsic that does the reciprocal of a w-bit quantity as the sum of two w-bit precision numbers where the second is much smaller than the first. But that's about it as far as extended precision goes for me.

I am not sure where the community at large is going as far as binary precision beyond 64-bits. For 20 years the FP community have been talking about 128-bit ion hardware. I do not think anybody does IEEE754 128-bit in hardware yet although Wikipedia says RISC9 does so. But I read the assembler manual and I cannot find the instructions. I think the IBM Z series has it but I do not have access to something like. I would have thought that Cray's links with Intel might be useful on that topic. I have a few words on the topic that Professor Kahan sent me but you probably need to chat about this to a numerical analyst far more knowledgeable than I. Once 128-bit becomes a reality, all these issues about 80-bit floating point will disappear.

On reviewing the assembler generated from a Chapel compilation, I had my first look in years at the generated C code and had a heart attack the other day and uttered some expletives under my breath. I dread the assembler. I have yet to tread near that one again. I probably need a few cans of beer and a comfortable chair before I start to look at that problem anew. I will send you some other stuff in that issue I raised about optimizing arithmetic operations using complex numbers and we can discuss it there if it makes sense.

On SIGFPE, I never catch it. I notice it when SIGFPE quits the program and tells me I screwed up so I can go and fix my stupidity, I really see no need to translate C signal handlers into Chapel native constructs. That said, you might look at ALGLIB, a C++ scientific library. It messes with signal handlers but I have never ventured to find out why. It seems over the top to me.

mppf commented 5 years ago

@damianmoz - thanks for your responses!

I would love a Chapel intrinsic that does the reciprocal of a w-bit quantity as the sum of two w-bit precision numbers where the second is much smaller than the first.

Is there an issue about this? This sounds like a fairly specific feature request.

For 20 years the FP community have been talking about 128-bit in hardware.

I think I read somewhere that double-double is not so bad (at least as a way of writing the extended precision in software). Certainly one would expect hardware support to be faster.

https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic

E.g. a library such as this: http://crd.lbl.gov/~dhbailey/mpdist/qd-2.3.22.tar.gz which is also a Debian package.

On reviewing the assembler generated from a Chapel compilation

I usually just run the program with a sufficiently long-running test kernel in gdb and then hit control-c to break it, go to the thread actually doing the computation, and then dump the instructions actually being run in the inner loop. I don't actually look at the C code when trying to answer questions about which instructions were used. Sometimes I try to save the assembly during compilation but it can be hard to find the relevant loop.

On SIGFPE, I never catch it.

OK, I'll note this in my comment above summarizing.

mppf commented 5 years ago

That said, you might look at ALGLIB, a C++ scientific library. It messes with signal handlers but I have never ventured to find out why. It seems over the top to me.

I tried looking at AGLIB 3.14.0 for C++ GPL version and didn't see any use of signal handlers. There are variables called signal for FFTs.

mppf commented 5 years ago

I've created a series of more specific issues to manage the future work from this issue: #11967 #11968 #11969 #11970.

I've changed this issue into an Epic and it's linked to the other issues.

damianmoz commented 1 year ago

I got asked today how the implementation into Chapel of fused multiply add is going, either explicitly with an inline proc fma(a, y, z) (map straight to a hardware FMA instruction) or better still implicitly, x * y + z or implicitly definitely not (x * y) + z.

bradcray commented 12 months ago

@damianmoz : To my knowledge, we haven't put any specific effort into this recently apart from upgrading to newer versions of LLVM which hopefully makes the back-end more and more capable of automatically optimizing such cases.

bradcray commented 12 months ago

(also, if this is something you'd consider high priority, it might be worth spawning it off into its own issue. This one is so broad and long-lived that I think it would take a lot of work to get from the OP to a specific request w.r.t. FMA. Doing a quick search, https://github.com/chapel-lang/chapel/issues/21043 looks slightly more focused, though it also still talks about general classes of routines rather than a specific request w.r.t. FMA).