SDL-Hercules-390 / hyperion

The SDL Hercules 4.x Hyperion version of the System/370, ESA/390, and z/Architecture Emulator
Other
240 stars 90 forks source link

TRE instruction performance #498

Closed JamesWekel closed 1 year ago

JamesWekel commented 2 years ago

Fish,

I've reworked the TRE instruction to improve performance. Included are two tests:

  1. TRE-01-basic: does basic functional tests
  2. TRE-02-performance: only TRE performance tests. I've included the following comment in the test to highlight that the tests needs to be manually enabled.
#  ---------------------------------------------------------------------------------- 
#  This ONLY tests the performance of the TRE instruction.
#
#  The default is to NOT run performance tests. To enable
#  performance test, uncomment the "#r           21fd=ff "
#  line below.
#
#        Tests:
#              1. TRE of 512 bytes
#              2. TRE of 512 bytes that crosses a page boundary, 
#                 which results in CC=3, and a branch back
#                 to complete the TRE instruction. So, 2 TRE
#                 are executed compared to test 1.     
#              3. TRE of 2048 bytes
#              4. TRE of 2048 bytes that crosses a page boundary, 
#                 which results in CC=3, and a branch back
#                 to complete the TRE instruction
#        Output: 
#               For each test, a console line will the generated with timing results,
#               as follows:
#               /         1,000,000 iterations of TRE   took     258,117 microseconds
#               /         1,000,000 iterations of TRE   took     305,606 microseconds
#               /         1,000,000 iterations of TRE   took   1,016,256 microseconds
#               /         1,000,000 iterations of TRE   took   1,056,531 microseconds 
#  ----------------------------------------------------------------------------------

Before the performance improvement, the results on my system were:

/         1,000,000 iterations of TRE   took   8,088,107 microseconds
/         1,000,000 iterations of TRE   took   8,077,128 microseconds
/         1,000,000 iterations of TRE   took  32,261,101 microseconds
/         1,000,000 iterations of TRE   took  32,724,749 microseconds

and after the improvement:

/         1,000,000 iterations of TRE   took     279,126 microseconds
/         1,000,000 iterations of TRE   took     308,843 microseconds
/         1,000,000 iterations of TRE   took   1,036,633 microseconds
/         1,000,000 iterations of TRE   took   1,055,807 microseconds

So, about a 95+ % improvement.

As part of the TRE improvement, I changed when CC=3 occurs. TRE will return CC=3 when a page boundary is encountered when processing operand 1. For each execution, operand 2 is always copied to a local translate table. So, I am really surprised on the minimal timing change between performance test 1 to 2 and 3 to 4, as tests 2 and 4 execute the two TRE instructions because operand 1 crosses a page boundary. Modern L1 processor caches are amazing!

I know that you are swamped so please review whenever you get to it.

Thanks, Jim.