SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

Diagnose F09 instcount issues? #584

Closed AndrewLowe65 closed 1 year ago

AndrewLowe65 commented 1 year ago

Hello,

I am trying to use diagnose F09 to measure individual CP instcount, but the results I get don't make sense to me: Issuing it twice with some instructions in between I get a delta of one.

Below a simple test case:

*
* DIAG xF09 test
*
sysclear
archmode esame
*
r 1a0=00000000800000000000000000000200 #       z/Arch restart PSW
r 1d0=0002000080000000000000000000DEAD #       z/Arch pgm new PSW
*
r 200=41D00300       # LA R13,300                Load R13 = 300
*
r 204=58ED0000       # L 14,0(,13)               R14 = DIAGOPTS
r 208=83FE0F09       # DIAG X'F09'               Issue diagnose request
r 20C=E3F0D0040024   # STG R15,4(,13)            store instruction count
*
r 212=58ED0000       # L 14,0(,13)               Just few instructions..
r 216=58ED0000       # L 14,0(,13)               ...
r 21A=58ED0000       # L 14,0(,13)               ...
r 21E=58ED0000       # L 14,0(,13)               ...
r 222=58ED0000       # L 14,0(,13)               ...
r 226=58ED0000       # L 14,0(,13)               ...
r 22A=58ED0000       # L 14,0(,13)               ...
r 22E=58ED0000       # L 14,0(,13)               ...
r 232=58ED0000       # L 14,0(,13)               Just few instructions..
*
r 236=58ED0000       # L 14,0(,13)               R14 = DIAGOPTS
r 23A=83FE0F09       # DIAG X'F09'               Issue diagnose request
r 23E=E3F0D00C0024   # STG R15,C(,13)            store instruction count
*
r 244=0000           # DC H'0'                   Force disabled wait
*
r 300=00010000       # DIAGOPTS DC X'00010000'   diagxF09Options
*
ostailor null
*
restart

Diagnose results before and after are stored at address 304 and 30C. They show 00000001 and 00000002

HHC00809I Processor CP00: disabled wait state 0002000080000000 000000000000DEAD
HHC01603I r 300
HHC02290I A:0000000000000000  K:06
HHC02290I R:0000000000000300  00010000 00000000 00000001 00000000  ................
HHC02290I R:0000000000000310  00000002 00000000 00000000 00000000  ................
HHC02290I R:0000000000000320  00000000 00000000 00000000 00000000  ................
HHC02290I R:0000000000000330  00000000 00000000 00000000 00000000  ................

Interestingly, when ran in instruction step mode, the program behaves as I would expect.

HHC00809I Processor CP00: disabled wait state 0002000080000000 000000000000DEAD
HHC01603I r 300
HHC02290I A:0000000000000000  K:06
HHC02290I R:0000000000000300  00010000 00000000 00000002 00000000  ................
HHC02290I R:0000000000000310  0000000D 00000000 00000000 00000000  ................
HHC02290I R:0000000000000320  00000000 00000000 00000000 00000000  ................
HHC02290I R:0000000000000330  00000000 00000000 00000000 00000000  ................

Thank you

AndrewLowe65 commented 1 year ago

Forgot to mention :

Hercules version 4.5.0.10830-SDL-g58578601

Thank you.

Fish-Git commented 1 year ago

By design, Hercules's instruction counter is not an absolutely accurate value, It is only an approximation.

By design, it is not updated after every instruction. Instead, it is only updated approximately once every "few" instructions (where "few" is currently "256").

This is due to the overhead involved with checking for interrupts during instruction execution. If Hercules checked for an interrupt after each instruction was executed, Hercules's performance would suffer. Hercules's instruction execution speed would slow to a crawl. Thus we only check for an interrupt only after every "few" (256) instructions or so (approximately).

The reason it works as expected when instruction stepping is on is because when instruction stepping is on, Hercules only executes one instruction at a time.

Since I'm guessing your learned about Hercules's Diagnose F09 by studying Hercules's code, here is the pertinent section of the run_cpu function's instruction execution loop that I am referring to that maintains Hercules's instruction counter:

https://github.com/SDL-Hercules-390/hyperion/blob/65c97fd6c4dcecd9ad64b89e65432e1bda9f7515/cpu.c#L2069-L2092

As you can see, Hercules does not check for interrupts after each instruction. It does so only approximately once every MAX_CPU_LOOPS instructions (currently #defined to 256).

And it also does not update its instruction counter after every executed instruction either. Instead, it updates its instruction counter only approximately once every MAX_CPU_LOOPS instructions too (currently #defined to 256).

This is by design for performance reasons.

AndrewLowe65 commented 1 year ago

Thank you very much for your kind and prompt explanation. I thought to a similar reason, which makes perfect sense to me.

Thank you also for the reference to the relevant section of run_cpu. I didn't know it.

While I consider the approximated value currently provided by diag F09 fair enough for common use, do you think that reducing the MAX_CPU_LOOPS value in a custom-built version to something like 32 would be something worth a try? Anything else that should be changed to make it work?

Of course I am willing to accept the associated performance degradation.

Thanks! Andrew

Fish-Git commented 1 year ago

... do you think that reducing the MAX_CPU_LOOPS value in a custom-built version to something like 32 would be something worth a try?

Me? No.

You? I don't know. I'm not you. Only you can answer that.

AndrewLowe65 commented 1 year ago

OK, I thought that reducing MAX_CPU_LOOPS by itself may not be enough or may have other negative side effects.

Yes, for me it is worth a try and I will do it.

Thanks!

Fish-Git commented 1 year ago

... I thought that reducing MAX_CPU_LOOPS ... may have other negative side effects.

It will most definitely have a negative side effect: decreased performance.

The question is, how much of a negative side effect will it have? Only a little? Or a lot?

The answer to that question is one I cannot provide. I don't know how much of a negative side effect it will have, because I have never tried it myself, nor have I tried to measure it. Again, only YOU will be able to answer that question.

Yes, for me it is worth a try and I will do it.

Good luck!

Please do let us know whether it works okay for you.

AndrewLowe65 commented 1 year ago

Following up my original request, I would like to share my experience fiddling with MAX_CPU_LOOPS.

I've recompiled Hercules setting MAX_CPU_LOOPS to 32. This was the only change I made. The modified version works with no apparent issues.

I ran some tests under MVS 3.8 and I measured a 30% performance hit. This is significant, but it is something you may be willing to pay to get a more accurate count from Diagnose F09.

All in all I am happy with the results, and I thought it may be helpful to share them.

Thanks!

Fish-Git commented 1 year ago

I've recompiled Hercules setting MAX_CPU_LOOPS to 32.

I ran some tests under MVS 3.8 and I measured a 30% performance hit.

Which is precisely why MAX_CPU_LOOPS currently defaults to 256.

This is significant, but it is something you may be willing to pay to get a more accurate count from Diagnose F09.

All in all I am happy with the results, and I thought it may be helpful to share them.

If you're happy, then we're happy. But as explained, Hercules is working as designed and we will not be changing MAX_CPU_LOOPS to a smaller value simply so Diagnose F09 returns approximately more accurate results for frequent queries. Sorry!

If you don't query the instruction count so frequently (i.e. if you don't issue Diagnose F09 so frequently), the results it provides should be accurate enough (IMHO) even when MAX_CPU_LOOPS remains set to its default value of 256. The way I would do it would be by time: only issue Diagnose F09 once every second or so.

I was thinking that what we might consider doing is changing the current hard-coded MAX_CPU_LOOPS constant into a variable value instead, along with a command that would allow you to dynamically modify it at will. This would allow you to temporarily change its value to something much smaller in order to measure the number of instructions executed for a short guest instruction path, and then reset it back to its default value afterwards. That would allow you to obtain a more accurate Diagnose F09 value on a per-need basis, while minimizing the impact to Hercules's performance during normal operation when such measurements are not needed.

However, even doing that (i.e. even introducing that one change of making the CPU loop value into a variable) would likely introduce its own measurable performance impact as well, so I would personally be be very reluctant to do it. Nevertheless, it is something to consider.

For now, I am keeping this issue closed as "WON'T FIX: The requested change was rejected or the described behavior is by design."

AndrewLowe65 commented 1 year ago

I would love to see your idea of dynamically adjustable MAX_CPU_LOOPS implemented, it would also allow to increase the value and see if it brings additional efficiency.

I don't know hercules' code, but I know some C, and looking at how other commands are implemented I may give it a try at coding the new one. What do you think?

Having the MAX_CPU_LOOPS constant replaced by a variable stored in an "Hercules architected" place would be a good start.

Thanks!

Fish-Git commented 1 year ago

I don't know hercules' code, but I know some C, and looking at how other commands are implemented I may give it a try at coding the new one. What do you think?

Go for it!

AndrewLowe65 commented 1 year ago

OK, you asked for it..

Attached a patch against the current master branch, which introduces the maxcpuloops command (the name sounds ugly to me; Feel free to use, change or ignore it):

All the new logic is subject to the definition of the VARIABLE_MAX_CPU_LOOPS macro, otherwise the existing logic is used.

Oh, wherever the MAX_CPU_LOOPS symbol is used, the code actually divides it by 2. For example in cpu.c:

for (i=0; i < MAX_CPU_LOOPS/2; i++)

The new code allows to perform the calculation once, when the sysblk.maxcpuloops value is set. I didn't do it yet to closely mimic the current behavior, but I think it would be good.

I hope it helps.

Fish-Git commented 1 year ago
  • MAXCPULOOPS.patch.txt

Thanks!

I'll take a look at it and give is a try just as soon as I can.

(Currently working on (looking into) several things at once (as usual!), so I can't say how soon I will be able to give it a try, but rest assured I will eventually give it a try!)

Fish-Git commented 1 year ago

What's the purpose of cfg_maxcpuloops? It's defined and set, but never used anywhere.

I'm guessing you were maybe thinking your new maxcpuloops command would also support "reset" as an argument to allow the user to reset if back to its original value. Yes? If so, then you forgot to code that logic in your maxcpuloops_cmd function.   :)

AndrewLowe65 commented 1 year ago

Hi Fish, thank you for reviewing the patch.

Yes, the purpose of cfg_maxcpuloops is to support the reset function, and yes, I forgot to add it. I will do it in the next iteration, if any..

Regarding the performance impact of the change, I had a look at the generated assembly code for the current version, and for a future version where the maxcpuloops variable is initialized to MAX_CPU_LOOPS/2 to avoid to perform the divide into the execution loop. These are the cases generated by gcc using the -O2 optimization option.

Current code:

    for (int i=0; i < MAX_CPU_LOOPS/2; i++)
    {
        call ...
    }

Generated assembly code

.L2:
    movl    %ebx, %edx
    movq    %rbp, %rsi
    movl    $1, %edi
    xorl    %eax, %eax
    call    ...
    addl    $1, %ebx
    cmpl    $128, %ebx
    jne .L2

Proposed code:

    for (int i=0; i < maxcpuloops; i++)
    {
        call ...
    }

Generated assembly code

.L2:
    movl    %ebx, %edx
    movq    %rbp, %rsi
    movl    $1, %edi
    xorl    %eax, %eax
    call    ...
    addl    $1, %ebx
    cmpl    %ebx, 4(%rsp)
    jg  .L2

The only noticeable difference is the cmpl instruction that is currently executed using the $128 constant value (256/2) and in the proposed code would use the variable pointed by 4(%rsp). Not a big deal I think, especially if the code can be disabled using the VARIABLE_MAX_CPU_LOOPS macro.

Thanks! Andrew