SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

CUSE performance improvement #519

Closed JamesWekel closed 1 year ago

JamesWekel commented 1 year ago

Fish,

Here is a proposed performance improvement for the CUSE instruction.

Before the change, the performance test reported:

         1,000,000 iterations of CUSE  took  10,295,109 microseconds
         1,000,000 iterations of CUSE  took  10,727,350 microseconds
         1,000,000 iterations of CUSE  took  11,088,763 microseconds
         1,000,000 iterations of CUSE  took  10,898,278 microseconds
         1,000,000 iterations of CUSE  took  10,432,426 microseconds
         1,000,000 iterations of CUSE  took  85,796,364 microseconds

and after the change:

         1,000,000 iterations of CUSE  took     415,661 microseconds
         1,000,000 iterations of CUSE  took     542,513 microseconds
         1,000,000 iterations of CUSE  took     296,749 microseconds
         1,000,000 iterations of CUSE  took     139,859 microseconds
         1,000,000 iterations of CUSE  took     505,759 microseconds
         1,000,000 iterations of CUSE  took     557,832 microseconds

The CUSE instruction performance is very dependent upon operand strings and substring length. CUSE-02-performance test shows performance increases between 95 - 99%. This increase is probably not representative of real-world instruction usage.

Thank you for your review and updates. I'm sure that there is something that I've missed.

Jim

Fish-Git commented 1 year ago

Thanks, James! I'll take a look at it and get back to you.

JamesWekel commented 1 year ago

Fish,

Performance has improved, but I'm not exactly happy with the implementation. I optimized special cases which leads to a long implementation. The solution needs a second pair of eyes!

Jim

Fish-Git commented 1 year ago

James (@JamesWekel),

Additionally, one more extremely minor required change (in addition to the previously mentioned):

For the CUSE-02-performance.tst script, I would suggest using "300" (the maximum allowed) for your runtest timeout value. Currently you have 500 coded, which is invalid, resulting (prior to my recent fix) in a default value of 30 seconds being used instead (which isn't long enough for the the current version of Hercules that doesn't have your fix).

Interestingly, even 300 seconds (5 minutes) isn't long enough for my system! But since I use a helper script which uses a runtest timeout factor of 2.0, I can manage to squeak by. When I tested current Hercules on my system (without your changes), here are the timings I got:

(current unfixed Hyperion):

         1,000,000 iterations of CUSE  took  23,531,501 microseconds
         1,000,000 iterations of CUSE  took  25,053,053 microseconds
         1,000,000 iterations of CUSE  took  25,340,175 microseconds
         1,000,000 iterations of CUSE  took  25,488,301 microseconds
         1,000,000 iterations of CUSE  took  23,953,078 microseconds
         1,000,000 iterations of CUSE  took 193,951,645 microseconds

So apparently your system is much faster than mine!   :)

Anyway...  I hope I didn't scare you off this project with my previous critique of your code.  :(

You just need to make a few changes and you should be good to go. I finished reviewing the rest of your code (all of your helper functions as well as the primary CUSE instruction function itself too) and everything else looks good!

So just make those suggested fixes I mentioned (about detecting when you've crossed over to a new page) and I'll be happy to accept your pull request.

Thanks!

JamesWekel commented 1 year ago

Fish,

Sorry for the late reply; busy with other life items.

Thank you! Thank you! Thank you! How the (beep) (beep) did I code PAGEFRAME_PAGEMASK when I thought I had coded PAGEFRAME_BYTEMASK!! I definitely needed to have a second pair of eyes as I wasn't seeing what was on the page.

The current CUSE implementation includes the following interrupt coding:

        /* update GPRs if we just crossed half page - could get rupt */
        if ((addr1 & 0x7FF) == 0 || (addr2 & 0x7FF) == 0)
        {
            /* Update R1 and R2 to point to next bytes to compare */
            SET_GR_A( r1, regs, addr1 );
            SET_GR_A( r2, regs, addr2 );

            /* Set R1+1 and R2+1 to remaining operand lengths */
            SET_GR_A( r1+1, regs, len1 );
            SET_GR_A( r2+1, regs, len2 );
        }

and yes, I "read in the manual that CUSE was an interruptible instruction". So, it must be necessary but, with the optimizations, I would never know when a half page was crossed, therefore I just changed when the registers were saved. I now have some more reading on Hercules interrupts!

Thank you for the CUSE patch. I do like the simpler look for page crossing checks. It is much more explicit in the code; not relying on a comment and how page offset bits are going to change when crossing a page. To make the code even more explicit, I've added the following macro

#undef  MAINSTOR_PAGEBASE
#define MAINSTOR_PAGEBASE( _ma )  ((BYTE*) ((uintptr_t) ( _ma ) & PAGEFRAME_PAGEMASK))

to further simplify the look of crossed page checks. I'll do another review this afternoon. You should have a commit tonight.

I've changed CUSE-02-performance.tst to have a timeout of 300. My performance tests were done using a 11th Gen Intel Core i5-1135G7 @ 2.40GHz.

Anyway... I hope I didn't scare you off this project with my previous critique of your code.

Not at all. :-) I'm working on the TRTR optimization and should have a pull request to you on the weekend.

Thanks again for the review and the second pair of eyes,

Jim

Fish-Git commented 1 year ago

Thank you! Thank you! Thank you! How the (beep) (beep) did I code PAGEFRAME_PAGEMASK when I thought I had coded PAGEFRAME_BYTEMASK!! I definitely needed to have a second pair of eyes as I wasn't seeing what was on the page.

It happens.  :)   Which is why it's always a good idea to have someone else review one's code. Hell, I'm constantly re-reviewing (and re-re-reviewing) my own code all the time! It also helps doing so at a much later point in time too (e.g. several weeks or months later). You'd be surprised how often during re-reviewing months later that you see things you never noticed before when you first wrote it.

The current CUSE implementation includes the following interrupt coding: [...] So, it must be necessary ...

Yeah, I saw that myself too. But you need to keep in mind that Hercules's current implementation of a given instruction/function isn't necessarily "gospel" (sacrosanct). A good programmer should always question existing code by asking oneself "Is this right?", "Is this necessary?", "Is this the best way to do this?". As it turns out in this particular case (as I explained in my earlier comment) it's not actually needed. It's a complete waste of time. The only time an interrupt can occur is if/when the instruction purposely exits (ends), which is obviously not occurring in this case.

I now have some more reading on Hercules interrupts!

A basic understanding certainly wouldn't hurt, but it's not critical. I wouldn't recommend spending a lot of time on it unless you were interested in examining that part of Hercules's functionality. For most instruction coding however, it's not that critical. I only tried explaining it so you could see (understand) how silly and unnecessary such code was.

Anyway... I hope I didn't scare you off this project with my previous critique of your code.

Not at all. :-)

Good! I was slightly worried. Glad we're still cool.  :)

I'm working on the TRTR optimization and should have a pull request to you on the weekend.

COOL! I look forward to seeing it! You're doing great work, James! MUCH appreciated!

Thanks again for the review and the second pair of eyes,

Not a problem.

Fish-Git commented 1 year ago

My performance tests were done using a 11th Gen Intel Core i5-1135G7 @ 2.40GHz.

My system has 3.0 GHz Xeon X5570s. I have no idea which "generation" they are. (I don't concern myself with such things.)

I think maybe the reason my numbers might be slower than yours is because:

  1. My system's processors are older than yours.
  2. I build Hercules with Visual Studio 2008.

I'm guessing you're using a much newer version of Visual Studio than me (2022? 2019?) and thus are using a version of Microsoft's compiler that is better able to leverage the newer instructions/features that your more modern processor has (whereas mine, having older processors and an older compiler, must resort to slightly less efficient instructions common with older processors).

But that's just a guess.

ONE of these days I'll upgrade my system to newer hardware and then maybe at that time I'll upgrade my Visual Studio as well. But right now I'm still poor and money is tight, so I'll probably be sticking with what I have for a few more years yet. (It's a nice system! I really like it! It's a Dell Precision T7500)

wrljet commented 1 year ago

You can compare almost any combination of CPUs on this site:

https://cpu.userbenchmark.com/Compare/Intel-Xeon-X5570-vs-Intel-Core-i5-1135G7/m12230vsm1286124

Bill

JamesWekel commented 1 year ago

Fish,

Let's try this again with a cross page fixes!

For my current environment, I'm using a 'small' Intel NUC with Debian as a server with "gcc (Debian 10.2.1-6) 10.2.1" for Hercules development. I access the environment using Visual Studio Code with remote explorer from a really old Intel NUC with a slow dual core i3-4010U CPU running Windows 10. I have Visual Studio 2022 on the I3 NUC just to check that I haven't broken the VS project files. The server NUC is newer (faster CPU and memory) so that's probably where I'm getting better results.

With the fixed cross page checks, the performance test now reports:

     1,000,000 iterations of CUSE  took     487,620 microseconds
     1,000,000 iterations of CUSE  took     544,157 microseconds
     1,000,000 iterations of CUSE  took     327,191 microseconds
     1,000,000 iterations of CUSE  took     176,237 microseconds
     1,000,000 iterations of CUSE  took     547,471 microseconds
     1,000,000 iterations of CUSE  took     608,836 microseconds

Hope your final review doesn't find any more major errors!

Jim

Fish-Git commented 1 year ago

Hope your final review doesn't find any more major errors!

Nope! Looks good to me! Pull Request accepted and merged!

Thanks again, James!