I've reworked the TRE instruction to improve performance. Included are two tests:
TRE-01-basic: does basic functional tests
TRE-02-performance: only TRE performance tests. I've included the following comment in the test to highlight that the tests needs to be manually enabled.
# ----------------------------------------------------------------------------------
# This ONLY tests the performance of the TRE instruction.
#
# The default is to NOT run performance tests. To enable
# performance test, uncomment the "#r 21fd=ff "
# line below.
#
# Tests:
# 1. TRE of 512 bytes
# 2. TRE of 512 bytes that crosses a page boundary,
# which results in CC=3, and a branch back
# to complete the TRE instruction. So, 2 TRE
# are executed compared to test 1.
# 3. TRE of 2048 bytes
# 4. TRE of 2048 bytes that crosses a page boundary,
# which results in CC=3, and a branch back
# to complete the TRE instruction
# Output:
# For each test, a console line will the generated with timing results,
# as follows:
# / 1,000,000 iterations of TRE took 258,117 microseconds
# / 1,000,000 iterations of TRE took 305,606 microseconds
# / 1,000,000 iterations of TRE took 1,016,256 microseconds
# / 1,000,000 iterations of TRE took 1,056,531 microseconds
# ----------------------------------------------------------------------------------
Before the performance improvement, the results on my system were:
/ 1,000,000 iterations of TRE took 8,088,107 microseconds
/ 1,000,000 iterations of TRE took 8,077,128 microseconds
/ 1,000,000 iterations of TRE took 32,261,101 microseconds
/ 1,000,000 iterations of TRE took 32,724,749 microseconds
and after the improvement:
/ 1,000,000 iterations of TRE took 279,126 microseconds
/ 1,000,000 iterations of TRE took 308,843 microseconds
/ 1,000,000 iterations of TRE took 1,036,633 microseconds
/ 1,000,000 iterations of TRE took 1,055,807 microseconds
So, about a 95+ % improvement.
As part of the TRE improvement, I changed when CC=3 occurs. TRE will return CC=3 when a page boundary is encountered when processing operand 1. For each execution, operand 2 is always copied to a local translate table. So, I am really surprised on the minimal timing change between performance test 1 to 2 and 3 to 4, as tests 2 and 4 execute the two TRE instructions because operand 1 crosses a page boundary. Modern L1 processor caches are amazing!
I know that you are swamped so please review whenever you get to it.
Fish,
I've reworked the TRE instruction to improve performance. Included are two tests:
Before the performance improvement, the results on my system were:
and after the improvement:
So, about a 95+ % improvement.
As part of the TRE improvement, I changed when CC=3 occurs. TRE will return CC=3 when a page boundary is encountered when processing operand 1. For each execution, operand 2 is always copied to a local translate table. So, I am really surprised on the minimal timing change between performance test 1 to 2 and 3 to 4, as tests 2 and 4 execute the two TRE instructions because operand 1 crosses a page boundary. Modern L1 processor caches are amazing!
I know that you are swamped so please review whenever you get to it.
Thanks, Jim.