SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
243 stars 91 forks source link

WHY static inline functions in header files? #392

Closed Fish-Git closed 3 years ago

Fish-Git commented 3 years ago

NOTE:  issues #396 "The most recent commit fails to build on Ubuntu" and #399 "warning: ‘fetch_dw_noswap’ is static but used in inline function ‘z900_vfetch8’ which is not static" are both closely related to this issue.


I'm wondering what the advantage of defining functions as static inline in header files is? Especially when the functions in question are more than one or two or maybe three lines long. It seems to me doing so has more disadvantages than advantages, with perhaps the biggest disadvantage being unnecessary consumption of instruction cache, thereby making the program less efficient than otherwise:

Is there some actual advantage to having functions declared/defined as static inline in header files as opposed to just doing things the normal way of simply declaring the functions in header files with the actual body of the function defined in a separate .c source file?

Why is so much or Hercules code written this way? (e.g. dat.h, vstore.h, inline.h hinlines.h, etc)  It certainly can't be for performance! So why are we doing it? What is the advantage? What am I missing?

ivan-w commented 3 years ago

Because a lot of optimization occurs by doing so. When a function is called, is static and is marked as eligible to be inlined and the call contain some constants (or can be tracked back to involve constants) a lot of generated code is simplified to the extreme (sometimes even generating NO code at all, or rescheduling instruction flow in an optimal way for the target). For example, SSA (Single Symbolic Assignment) Tree optimization is highly efficient and relies a lot on having a maximum of information from the program source. If the function is external (in another file), no additional information is known and the function is pretty much forced to perform the complete process - whereas when it is static/inline is does just and ONLY what is needed.

Basically, a static/inline function and a macro yield the same performance (with the additional advantage of being able to share whatever is needed across unrelated functions in the same file). For example:

File foo.c

extern int bar(int);
int foo()
{
   return bar(10);
}

file bar.c

int bar(int x)
{
    int i,c;
    for(i=0,c=0;i<x;i++)
    {
        c+=i;
    }
    return c;
}

Compiling this (for x64 using gcc -O3) : foo : (basically call bar)

    .file   "foo.c"
    .text
    .p2align 4,,15
    .globl  foo
    .type   foo, @function
foo:
.LFB0:
    .cfi_startproc
    movl    $10, %edi
    jmp bar@PLT
    .cfi_endproc
.LFE0:
    .size   foo, .-foo
    .ident  "GCC: (Debian 8.3.0-6) 8.3.0"
    .section    .note.GNU-stack,"",@progbits

bar : (does the actual calculation...)

    .file   "bar.c"
    .text
    .p2align 4,,15
    .globl  bar
    .type   bar, @function
bar:
.LFB0:
    .cfi_startproc
    testl   %edi, %edi
    jle .L8
    leal    -1(%rdi), %eax
    cmpl    $17, %eax
    jbe .L9
    movl    %edi, %edx
    movdqa  .LC0(%rip), %xmm1
    xorl    %eax, %eax
    pxor    %xmm0, %xmm0
    movdqa  .LC1(%rip), %xmm2
    shrl    $2, %edx
    .p2align 4,,10
    .p2align 3
.L4:
    addl    $1, %eax
    paddd   %xmm1, %xmm0
    paddd   %xmm2, %xmm1
    cmpl    %edx, %eax
    jne .L4
    movdqa  %xmm0, %xmm1
    movl    %edi, %edx
    psrldq  $8, %xmm1
    andl    $-4, %edx
    paddd   %xmm1, %xmm0
    movdqa  %xmm0, %xmm1
    psrldq  $4, %xmm1
    paddd   %xmm1, %xmm0
    movd    %xmm0, %eax
    cmpl    %edx, %edi
    je  .L13
    .p2align 4,,10
    .p2align 3
.L7:
    addl    %edx, %eax
    addl    $1, %edx
    cmpl    %edx, %edi
    jg  .L7
    ret
    .p2align 4,,10
    .p2align 3
.L13:
    ret
    .p2align 4,,10
    .p2align 3
.L8:
    xorl    %eax, %eax
    ret
.L9:
    xorl    %eax, %eax
    xorl    %edx, %edx
    jmp .L7
    .cfi_endproc
.LFE0:
    .size   bar, .-bar
    .section    .rodata.cst16,"aM",@progbits,16
    .align 16
.LC0:
    .long   0
    .long   1
    .long   2
    .long   3
    .align 16
.LC1:
    .long   4
    .long   4
    .long   4
    .long   4
    .ident  "GCC: (Debian 8.3.0-6) 8.3.0"
    .section    .note.GNU-stack,"",@progbits

Now let's bring the 2 together :

foobar.c:

static __inline__ int bar(int x)
{
    int i,c;
    for(i=0,c=0;i<x;i++)
    {
        c+=i;
    }
    return c;
}
int foo()
{
    return bar(10);
}

And compile it :

    .file   "foobar.c"
    .text
    .p2align 4,,15
    .globl  foo
    .type   foo, @function
foo:
.LFB1:
    .cfi_startproc
    movl    $45, %eax
    ret
    .cfi_endproc
.LFE1:
    .size   foo, .-foo
    .ident  "GCC: (Debian 8.3.0-6) 8.3.0"
    .section    .note.GNU-stack,"",@progbits
~

(Yeah basically, gcc figured I was being stupid so fixed it for me.. sum of 0 to n-1 is (n*(n-1))/2

The conclusion is : always give a maximum of information to the compiler and let it use it !

Fish-Git commented 3 years ago

Because a lot of optimization occurs by doing so.

But is this "optimization" that is being done actually beneficial when you take into consideration the increased instruction cache that is being consumed by doing so? If a given static inline function is used in 100+ different places throughout your program, you end up consuming 100+ times the amount of instruction cache than otherwise. Instruction cache that could have been better served by keeping other more frequently used code in cache instead.

Doesn't it make better sense to let the optimizer decide for itself whether it is faster to inline a given function or to simply call it instead? That's what compiler optimizers excel at: cost benefit analysis.

If the function is external (in another file), no additional information is known and the function is pretty much forced to perform the complete process...

That's not true! Why do you say that? What information doesn't a compiler have that it wouldn't otherwise have? A compiler doesn't use source code to perform optimizations with. It does its optimization to the actual object code (machine / assembler / intermediate) itself. So what difference does it make whether the function it's trying to inline is "local" to the source file being compiled or global to the entire program? It knows it's declared "inline" so it can examine everywhere it is being called throughout the entire program (all 100+ places) and choose or not to choose to inline it and each of those places or to let some of them make a function call to a common function instead because it determined doing so would be faster.

Why do so many programmers think they can optimize better than the compiler? Just code the dang function and let the optimizer do its job! That's what it's for! YOUR job is to just write functionally correct easy to maintain and efficient code!

For example :

Extremely poor example which does NOT properly illustrate the issue at hand. The static inline functions in dat.h, vstore.h, inline.h hinlines.h, etc. are hardly one or two or even three line functions. They're quite a bit longer.

Further, as explained further above, if the function in question is used in hundreds of places all throughout your program, then by declaring it static inline (instead of just inline), you're consuming more instruction cache than otherwise necessary and introducing code bloat as well (making the program larger than necessary).

Me? I prefer smaller and likely faster rather than bigger and likely slower.

But ultimately it's up to the compiler to decide which would be best, not the arrogant developer who thinks they're smarter than it is.

ivan-w commented 3 years ago

I'd wager that if the compiler thinks it's not worthwhile to be inlined then... it won't! The only thing thatinlinedoes is make the function eligible for inlining (and therefore, you can't get the function address, but we don't need that). If a function is bound to be called by multiple callers within a file, then it's good practice IMHO to static inline it.

Furthermore, with the functions such aslogical_to_main()and the like, this usually resolves to 3 or 4 instructions (a test, a conditional branch, a load and a xor). If the entry is not in the TLB then a more complete version is going to kick in and this will eventually lead to a call to an external function.

Fish-Git commented 3 years ago

I'd wager that if the compiler thinks it's not worthwhile to be inlined then... it won't!

The only thing thatinlinedoes is make the function eligible for inlining...

Correct.

It is only a request to the compiler to please consider inlining it. But it is not a demand. Just because a function is declared inline doesn't mean it will be. It just means it might be, if the compiler determines that doing so would result in faster code and it is able to do so.

We all already know that, Ivan!

But that is not the issue I'm arguing here!

Furthermore, with the functions such aslogical_to_main()and the like, this usually resolves to 3 or 4 instructions (a test, a conditional branch, a load and a xor).

Where are you seeing this? I'm not disagreeing with you! I'm just unable to find the code that's doing this. Unless you're maybe talking about themaddr_lfunction indat.hwhere it's doing a TLB hit test and then doing aMAINADDRon a TLB hit? (which isn't a function by the way, but rather is a macro that does a simple xor of the mainstor address from the TLB entry with the logical address being translated)

But again, I think you're missing the point!

If a given function that is declared in the header file as static cannot be inlined into a given caller, that caller ends up making a function call to ........ who? What "function" does it end up calling, Ivan? Yep! That's right! That "private" static function!

And if you have dozens or even hundreds of source modules, each one of which contains code that calls that very same function, each of them (if the call cannot be inlined) will end up calling their own private "static" copy of that function!

Static functions are not globally visible across compilation units. "extern" functions are, but not static functions. As a result, you'll end up with your program calling the exact same logical code but in dozens of different physical locations! (thereby consuming dozens of times more host process instruction cache than otherwise had the function instead been declared "extern inline" rather than "static inline" like we're currently doing).

Very common (and potentially "hot" functions) should not IMHO be declared as static inline functions!

They should instead be declared as extern inline functions!

This allows the compiler to decide whether to inline it or not just like always, but prevents the problem I just described. The function's actual code will physically exist in only ONE physical place, and everywhere across the entire product where that function is called (because it couldn't be inlined) will all end up calling that ONE function in that ONE physical location.

With your way each compilation unit ends up calling their own private copy of that very same function, when IMHO it would be much more efficient to eliminate such physical code redundancy by having them all call the same common globally visible function instead, thereby consuming the absolute minimum host processor instruction cache.

Or am I crazy and missing something obvious? Because as I said in this issue's opening remarks, I personally see no advantage -- neither coding-wise nor especially not code-execution performance-wise -- of defining functions as static inline as opposed to simply declaring them as extern inline instead.

Fish-Git commented 3 years ago

Here is a simple test program to illustrate why using "static inline" is bad and why using "extern inline" is preferred:

(I used Visual Studio on Windows to build and run the test so if you want to try it on your own non-Windows system you will have to manually construct your own makefile for it. I doubt doing so would be that hard.)

Here's the output it produces (with some blank lines thrown in):

003510BE = a_extern_inline_func, called by a
0035117C = b_extern_inline_func, called by a
00351096 = c_extern_inline_func, called by a

003514F0 = a_static_inline_func, called by a
00351560 = b_static_inline_func, called by a
003515D0 = c_static_inline_func, called by a

003510BE = a_extern_inline_func, called by b
0035117C = b_extern_inline_func, called by b
00351096 = c_extern_inline_func, called by b

00351750 = a_static_inline_func, called by b
003517C0 = b_static_inline_func, called by b
00351830 = c_static_inline_func, called by b

003510BE = a_extern_inline_func, called by c
0035117C = b_extern_inline_func, called by c
00351096 = c_extern_inline_func, called by c

003519B0 = a_static_inline_func, called by c
00351A20 = b_static_inline_func, called by c
00351A90 = c_static_inline_func, called by c

Here's the exact same output manually re-sequenced to better illustrate my point:

003510BE = a_extern_inline_func, called by a
003510BE = a_extern_inline_func, called by b
003510BE = a_extern_inline_func, called by c

0035117C = b_extern_inline_func, called by a
0035117C = b_extern_inline_func, called by b
0035117C = b_extern_inline_func, called by c

00351096 = c_extern_inline_func, called by a
00351096 = c_extern_inline_func, called by b
00351096 = c_extern_inline_func, called by c

003514F0 = a_static_inline_func, called by a
00351750 = a_static_inline_func, called by b
003519B0 = a_static_inline_func, called by c

00351560 = b_static_inline_func, called by a
003517C0 = b_static_inline_func, called by b
00351A20 = b_static_inline_func, called by c

003515D0 = c_static_inline_func, called by a
00351830 = c_static_inline_func, called by b
00351A90 = c_static_inline_func, called by c

As you can see, each of the "static inline" functions are at a different physical address, thereby unnecessarily consuming valuable host processor instruction cache in order to execute the exact same code!

But if you declare the functions as "extern inline" instead, each function exists in only ONE location in memory, thereby making efficient use of valuable host processor instruction cache.

Note that this is true ONLY for "inline" functions which the compiler, for whatever reason, determines that it cannot actually inline the function. For instances where the compiler decides to actually inline the function, there would of course be a small increase in host processor instruction cache use, but the compiler optimizer has already taken that into consideration when it performed its cost-benefit analysis to decide whether by inlining the function whether the resulting code would be faster or slower.

We need to change all of our static inlines to extern inlines instead (with the actual function body defined in some .c file somewhere) as I am certain doing so will provide a non-insignificant improvement in Hercules's overall performance (not to mention make it easier to read and maintain too!).

I am going to mark this issue as "Waiting to Close" to give each of you (especially Ivan! @ivan-w) a chance to provide feedback if any of you feel so inclined to do so.

Thank you for listening.

Fish-Git commented 3 years ago

FYI #1: commit 49e2723ea19d556473be372f3a14354f848bd18a implements the changes.

FYI #2: Linux builds are failing dramatically with a lot of warnings similar to: "warning: inline function 's370_vfetch4' declared but never defined" and I don't know why!   :(

It builds and runs perfectly fine on Windows using Visual Studio! I have no fricking clue why it's failing on Linux, and worse, I have no ficking clue how to fix it!   :(

HELP!   :(

wrljet commented 3 years ago

As inline was not originally in C, and got added by various compilers, and only later added to the standards, it has varied implementations.

The author explains some of the ISO standard and gcc rules.

There's a large handful of switches that affect the behavior of gcc inlining. (because of course there are)

A key point is:

      "For a function defined with extern inline, stand-alone object       code is never emitted. You can have multiple such definitions       and your program will still work. However, you should add       a non-inline definition somewhere too, in case the function       is not inlined everywhere."

At the bottom of the page he describes a few portable strategies.

Fish-Git commented 3 years ago

So what needs to be done to fix the problem? Anyone?

Fish-Git commented 3 years ago

Eureka!

I found the answer to my problem on stackoverflow, and it's actually amazingly simple!

Here's the new version of my test program that now compiles and runs identically on both Windows and_Linux:

To build it, no makefile is needed.  Just use:

gcc  a.c  b.c  c.c   inline.c  -o inline  -I.

Tested on my VMware KDE Neon 5.20 virtual machine with gcc 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04).

I'm going to try and make the needed changes to Hercules to fix our current build problem and commit the fix when I'm done. I'll be sure to test it on my Neon system before committing it though! I would have done that before making my previous commit but I honestly thought my change was rather non-controversial. I never expected gcc to barf the way it did. Silly me.

And for those interested (who don't want to bother building my little test program for themselves), here's the results I get on my KDE Neon 5.20 system (reformatted to illustrate the problem and how my fix actually corrects the problem):

0x55ef1e28f1f7 = a_extern_inline_func, called by a
0x55ef1e28f1f7 = a_extern_inline_func, called by b
0x55ef1e28f1f7 = a_extern_inline_func, called by c

0x55ef1e28f332 = b_extern_inline_func, called by a
0x55ef1e28f332 = b_extern_inline_func, called by b
0x55ef1e28f332 = b_extern_inline_func, called by c

0x55ef1e28f46d = c_extern_inline_func, called by a
0x55ef1e28f46d = c_extern_inline_func, called by b
0x55ef1e28f46d = c_extern_inline_func, called by c

0x55ef1e28f149 = a_static_inline_func, called by a
0x55ef1e28f284 = a_static_inline_func, called by b
0x55ef1e28f3bf = a_static_inline_func, called by c

0x55ef1e28f183 = b_static_inline_func, called by a
0x55ef1e28f2be = b_static_inline_func, called by b
0x55ef1e28f3f9 = b_static_inline_func, called by c

0x55ef1e28f1bd = c_static_inline_func, called by a
0x55ef1e28f2f8 = c_static_inline_func, called by b
0x55ef1e28f433 = c_static_inline_func, called by c

As you can see it looks identical to the Windows example I posted previously.

Bottom line:   "static inline" is bad.   "extern inline" is good.

wrljet commented 3 years ago

Fish,

I guess you didn't bother to read the article I posted in the morning. At the bottom, as mentioned, was the solution.

Bill

Fish-Git commented 3 years ago

I guess you didn't bother to read the article I posted in the morning. At the bottom, as mentioned, was the solution.

I did read it, and is the technique I thought I had implemented! (albeit without the #if __GNUC__ && !__GNUC_STDC_INLINE__ check however).

I declared/defined the inline functions in the .h files with just inline, and used the extern inline declaration in just the .c file, as documented as being "the conforming way" in that stackoverflow post I referenced, as I didn't think the technique that your reference suggested was actually needed).

I guess I was wrong? Maybe it is actually needed?  Dunno yet.  <shrug>

I'm going to first try fixing all the fetch_dw_noswap et al. functions from static inline to just inline (with the corresponding extern inline declaration in a separate .c file) technique, just like I had to do with all the other inline functions I originally had change in order to fix the original issue #396 problem (i.e. my commit 328266663492b8ce7d7473000f6ca441750421d4 technique).

If doing that still doesn't fix things however, then I'll try the #if __GNUC__ && !__GNUC_STDC_INLINE__ technique.

After first slitting my throat.

wrljet commented 3 years ago

After first slitting my throat.

Yeah, I hear you!

Fish-Git commented 3 years ago

I'm going to first try fixing all the fetch_dw_noswap et al. functions from static inline to just inline (with the corresponding extern inline declaration in a separate .c file) technique, just like I had to do with all the other inline functions I originally had change in order to fix the original issue #396 problem (i.e. my commit 3282666 technique).

If doing that still doesn't fix things however, then I'll try the #if __GNUC__ && !__GNUC_STDC_INLINE__ technique.

After first slitting my throat.

I guess I get to live.   :)

Commit 173bb145832e1e7cfb87cdfb354c9d8c70439cda (and 38d6835b7460737a83aca078c259c11167a6ef63) has (have) fixed the problem. Hercules now successfully builds -- AND RUNS! -- with most of our original static inline functions converted to inline + extern inline.

I haven't redone any of my performance comparisons yet, but I will, and will post the results here. I suspect performance will improve even more than before now that the machdep and hbyteswp functions have been converted too. I guess we'll see shortly, eh?

Fish-Git commented 3 years ago

I haven't redone any of my performance comparisons yet, but I will, and will post the results here. I suspect performance will improve even more than before now that the machdep and hbyteswp functions have been converted too. I guess we'll see shortly, eh?

I lied.

For the past month I've done many dozens of performance runs and the results have always been very inconsistent. Sometimes I get improved performance, sometimes I get worse performance.

I tried averaging performance runs (and by "performance runs" I mean 'maxrates' reported MIPS rate) over 5 to 10 separate runs and they're still inconsistent. I tried tossing out the lowest and highest and averaging the remaining, using just the best rate of all runs, etc, and still the rates are inconsistent.

So screw it. Until we can somehow design a consistent and reliable means of measuring increases or decreases in performance, I'm not going to overly concern myself with performance. Only if the performance consistently decreases in a noticeable dramatic fashion will I then consider it to be something to worry about.

Of course that's coming from someone whose initial decision to change static inlines to normal inlines in the first damn place was based on the premise that the latter was more efficient and thus should theoretically increase overall performance too, so take what I said with a grain of salt.

ANYWAY... I'm tired of working on this issue so am closing it at this time. Feel free to re-open it again if you feel the need to. My apologies for perhaps wasting(?) everyone's time? :(