Closed AndrewLowe65 closed 1 year ago
Forgot to mention :
Hercules version 4.5.0.10830-SDL-g58578601
Thank you.
By design, Hercules's instruction counter is not an absolutely accurate value, It is only an approximation.
By design, it is not updated after every instruction. Instead, it is only updated approximately once every "few" instructions (where "few" is currently "256").
This is due to the overhead involved with checking for interrupts during instruction execution. If Hercules checked for an interrupt after each instruction was executed, Hercules's performance would suffer. Hercules's instruction execution speed would slow to a crawl. Thus we only check for an interrupt only after every "few" (256) instructions or so (approximately).
The reason it works as expected when instruction stepping is on is because when instruction stepping is on, Hercules only executes one instruction at a time.
Since I'm guessing your learned about Hercules's Diagnose F09 by studying Hercules's code, here is the pertinent section of the run_cpu
function's instruction execution loop that I am referring to that maintains Hercules's instruction counter:
As you can see, Hercules does not check for interrupts after each instruction. It does so only approximately once every MAX_CPU_LOOPS
instructions (currently #defined to 256).
And it also does not update its instruction counter after every executed instruction either. Instead, it updates its instruction counter only approximately once every MAX_CPU_LOOPS
instructions too (currently #defined to 256).
This is by design for performance reasons.
Thank you very much for your kind and prompt explanation. I thought to a similar reason, which makes perfect sense to me.
Thank you also for the reference to the relevant section of run_cpu. I didn't know it.
While I consider the approximated value currently provided by diag F09 fair enough for common use, do you think that reducing the MAX_CPU_LOOPS value in a custom-built version to something like 32 would be something worth a try? Anything else that should be changed to make it work?
Of course I am willing to accept the associated performance degradation.
Thanks! Andrew
... do you think that reducing the MAX_CPU_LOOPS value in a custom-built version to something like 32 would be something worth a try?
Me? No.
You? I don't know. I'm not you. Only you can answer that.
OK, I thought that reducing MAX_CPU_LOOPS by itself may not be enough or may have other negative side effects.
Yes, for me it is worth a try and I will do it.
Thanks!
... I thought that reducing MAX_CPU_LOOPS ... may have other negative side effects.
It will most definitely have a negative side effect: decreased performance.
The question is, how much of a negative side effect will it have? Only a little? Or a lot?
The answer to that question is one I cannot provide. I don't know how much of a negative side effect it will have, because I have never tried it myself, nor have I tried to measure it. Again, only YOU will be able to answer that question.
Yes, for me it is worth a try and I will do it.
Good luck!
Please do let us know whether it works okay for you.
Following up my original request, I would like to share my experience fiddling with MAX_CPU_LOOPS
.
I've recompiled Hercules setting MAX_CPU_LOOPS
to 32. This was the only change I made. The modified version works with no apparent issues.
I ran some tests under MVS 3.8 and I measured a 30% performance hit. This is significant, but it is something you may be willing to pay to get a more accurate count from Diagnose F09.
All in all I am happy with the results, and I thought it may be helpful to share them.
Thanks!
I've recompiled Hercules setting
MAX_CPU_LOOPS
to 32.I ran some tests under MVS 3.8 and I measured a 30% performance hit.
Which is precisely why MAX_CPU_LOOPS
currently defaults to 256.
This is significant, but it is something you may be willing to pay to get a more accurate count from Diagnose F09.
All in all I am happy with the results, and I thought it may be helpful to share them.
If you're happy, then we're happy. But as explained, Hercules is working as designed and we will not be changing MAX_CPU_LOOPS
to a smaller value simply so Diagnose F09 returns approximately more accurate results for frequent queries. Sorry!
If you don't query the instruction count so frequently (i.e. if you don't issue Diagnose F09 so frequently), the results it provides should be accurate enough (IMHO) even when MAX_CPU_LOOPS
remains set to its default value of 256. The way I would do it would be by time: only issue Diagnose F09 once every second or so.
I was thinking that what we might consider doing is changing the current hard-coded MAX_CPU_LOOPS
constant into a variable value instead, along with a command that would allow you to dynamically modify it at will. This would allow you to temporarily change its value to something much smaller in order to measure the number of instructions executed for a short guest instruction path, and then reset it back to its default value afterwards. That would allow you to obtain a more accurate Diagnose F09 value on a per-need basis, while minimizing the impact to Hercules's performance during normal operation when such measurements are not needed.
However, even doing that (i.e. even introducing that one change of making the CPU loop value into a variable) would likely introduce its own measurable performance impact as well, so I would personally be be very reluctant to do it. Nevertheless, it is something to consider.
For now, I am keeping this issue closed as "WON'T FIX: The requested change was rejected or the described behavior is by design."
I would love to see your idea of dynamically adjustable MAX_CPU_LOOPS implemented, it would also allow to increase the value and see if it brings additional efficiency.
I don't know hercules' code, but I know some C, and looking at how other commands are implemented I may give it a try at coding the new one. What do you think?
Having the MAX_CPU_LOOPS constant replaced by a variable stored in an "Hercules architected" place would be a good start.
Thanks!
I don't know hercules' code, but I know some C, and looking at how other commands are implemented I may give it a try at coding the new one. What do you think?
Go for it!
OK, you asked for it..
Attached a patch against the current master branch, which introduces the maxcpuloops
command (the name sounds ugly to me; Feel free to use, change or ignore it):
All the new logic is subject to the definition of the VARIABLE_MAX_CPU_LOOPS
macro, otherwise the existing logic is used.
Oh, wherever the MAX_CPU_LOOPS
symbol is used, the code actually divides it by 2. For example in cpu.c:
for (i=0; i < MAX_CPU_LOOPS/2; i++)
The new code allows to perform the calculation once, when the sysblk.maxcpuloops
value is set. I didn't do it yet to closely mimic the current behavior, but I think it would be good.
I hope it helps.
- MAXCPULOOPS.patch.txt
Thanks!
I'll take a look at it and give is a try just as soon as I can.
(Currently working on (looking into) several things at once (as usual!), so I can't say how soon I will be able to give it a try, but rest assured I will eventually give it a try!)
What's the purpose of cfg_maxcpuloops
? It's defined and set, but never used anywhere.
I'm guessing you were maybe thinking your new maxcpuloops
command would also support "reset" as an argument to allow the user to reset if back to its original value. Yes? If so, then you forgot to code that logic in your maxcpuloops_cmd
function. :)
Hi Fish, thank you for reviewing the patch.
Yes, the purpose of cfg_maxcpuloops
is to support the reset function, and yes, I forgot to add it. I will do it in the next iteration, if any..
Regarding the performance impact of the change, I had a look at the generated assembly code for the current version, and for a future version where the maxcpuloops variable is initialized to MAX_CPU_LOOPS/2 to avoid to perform the divide into the execution loop. These are the cases generated by gcc using the -O2 optimization option.
Current code:
for (int i=0; i < MAX_CPU_LOOPS/2; i++)
{
call ...
}
Generated assembly code
.L2:
movl %ebx, %edx
movq %rbp, %rsi
movl $1, %edi
xorl %eax, %eax
call ...
addl $1, %ebx
cmpl $128, %ebx
jne .L2
Proposed code:
for (int i=0; i < maxcpuloops; i++)
{
call ...
}
Generated assembly code
.L2:
movl %ebx, %edx
movq %rbp, %rsi
movl $1, %edi
xorl %eax, %eax
call ...
addl $1, %ebx
cmpl %ebx, 4(%rsp)
jg .L2
The only noticeable difference is the cmpl instruction that is currently executed using the $128 constant value (256/2) and in the proposed code would use the variable pointed by 4(%rsp). Not a big deal I think, especially if the code can be disabled using the VARIABLE_MAX_CPU_LOOPS macro.
Thanks! Andrew
Hello,
I am trying to use diagnose F09 to measure individual CP instcount, but the results I get don't make sense to me: Issuing it twice with some instructions in between I get a delta of one.
Below a simple test case:
Diagnose results before and after are stored at address 304 and 30C. They show 00000001 and 00000002
Interestingly, when ran in instruction step mode, the program behaves as I would expect.
Thank you