Open paulb-nl opened 2 years ago
Nice test! Would it be possible to color the text green/red based to reflect correct/incorrect based on the hw results?
That would get a bit complicated with the 2 different chip versions and 8 setting combinations. It is also complicated to decide if a small difference is acceptable because some tests like the plot tests need to be accurate to 1/8th of a cycle. The cycle count can change after 8 plot instructions because then it will write the pixel data to ram.
Here is a comparison of some differences. These are not all the differences but I think it is enough for now :)
The cycles mentioned below with the 10MHz tests are 10MHz cycles so 1 cycle = 2x 21MHz cycles.
MiSTer vs Stunt Race FX (GSU1):
10MHz, MS0, No cache: Everything is too fast. NOP $72F-$4C8 = $267 = 3 cycles too fast ADC # (2 NOPS) $993-$660 = $333 = 4 cycles too fast MiSTer NOP vs 2 NOPS $660 - $4C8 = $198 = 2 cycles GSU NOP vs 2 NOPS $993 - $72F = $264 = 3 cycles
10MHz, MS0, Cache on FMULT $8C9 - $7F9 = $D0 = 1 cycle too fast GETB* $7FC - $662 = $19A = 2 cycles too fast GETB_2 $730 - $595 = $19B = 2 cycles too fast LDB $663 - $595 = $CE = 1 cycle too fast LDW $730 - $661 = $CF = 1 cycle too fast LM $994 - $8C5 = $CF = 1 cycle too fast LMS $8C7 - $7F8 = $CF = 1 cycle too fast LMULT $994 - $8C5 = $CF = 1 cycle too fast SBK $4CB - $3FE = $CD = 1 cycle too fast SM $663 - $595 = $CE = 1 cycle too fast SMS $597 - $4C9 = $CE = 1 cycle too fast STW $4CB - $3FD = $CE = 1 cycle too fast
10MHz, MS1, Cache on FMULT $598 - $4C9 = $CF = 1 cycle too fast LMULT $663 - $595 = $CE= 1 cycle too fast
10MHz PLOT, Cache on PLOT 4 color: $267 - $29A = -$33 = 0.25 cycles too slow (2 cycles every 8 plots?) PLOT 16 color: $266 - $2FE = -$98 = 0.75 cycles too slow (6 cycles every 8 plots?) PLOT 256 color: $280 - $3CA = -$14A = 1.625 cycles too slow (13 cycles every 8 plots?)
The PLOT -> LOOP-> NOP loop takes 3 cycles so 8 plots takes 8x3= 24 cycles. This is enough cycles to save the secondary pixel cache to RAM for 4 & 16 color data without waiting so PLOT should only take 1 cycle. For 256 color PLOT is 0.125 cycles slower ($280 vs $266) so it seems to wait 1 cycle every 8 plots.
PLOT with color #$FC should be treated as no-plot in 4 color transparent mode since low 2 bits are zero.
21MHz, MS0, No cache FMULT $AC5 - $7F8 = $2CD = 7 cycles too fast GETB* $CC4 - $BF4 = $D0 = 2 cycles too fast GETB_2 $AC6-$9F6 = $D0 = 2 cycles too fast LDB $A60 - $C5A = -$1FA = 5 cycles too slow LDW $9F9 - $C5A = -$261 = 6 cycles too slow LM $FF4 - $1253 = -$25F = 6 cycles too slow LMS $DF6 - $1055 = -$25F = 6 cycles too slow LMULT $CC4 - $9F6 = $2CE = 7 cycles too fast MULT $861 - $7F8 = $69 = 1 cycle too fast SBK $BF8 - $C5A = -$62 = 1 cycle too slow SM $FF4 - $1055 = -$61 = 1 cycle too slow SMS $DF6 - $E58 = -$62 = 1 cycle too slow STW $9F9 - $A5C = -$63 = 1 cycle too slow UMULT $861 - $7F8 = $69 = 1 cycle too fast
21MHz, MS1, No cache FMULT $92D - $7F8 = $135 = 3 cycles too fast LMULT $B2B - $9F6 = $135 = 3 cycles too fast
21MHz, MS0, Cache on FMULT $466 - $3FD = $69 = 1 cycle too fast GETB* $4CB - $3FE = $CD = 2 cycles too fast GETB_2 $465- $397 = $CE = 2 cycles too fast LDW $531 - $595 = -$64 = 1 cycle too slow LM $663 - $6C7 = -$64 = 1 cycle too slow LMS $5FD - $661 = -$64 = 1 cycle too slow LMULT $4CB - $463 = $68 = 1 cycle too fast SBK $3FF - $463 = -$64 = 1 cycle too slow SM $4CB - $52F = -$64 = 1 cycle too slow SMS $465 - $4C9 = -$64 = 1 cycle too slow STW $3FF - $463 = -$64 = 1 cycle too slow
21MHz, MS1, Cache on FMULT $2CD - $3FD = -$130 = 3 cycles too slow LMULT $332 - $463 = -$131 = 3 cycles too slow
21MHz PLOT Cache on PLOT 4 color: $134 - $19B = -$67 = 1 cycle too slow (8 cycles every 8 plots?) PLOT 16 color: $133 - $218 = -$E5 = 2.25 cycles too slow (18 cycles every 8 plots?) PLOT 256 color: $20C - $317 = -$10B = 2.625 cycles too slow (21 cycles every 8 plots?)
If i remember right GSU code was written as a functional analog, not cycle accurate. So, most likely it needs rework with cycle accuracy.
With this list it may seem that not much is accurate but many of the instructions in 21MHz mode (and 10Mhz with cache) are accurate.
Almost all of the instructions that are not accurate are about reading/writing from ROM/RAM and the multiplier instructions.
Fixed some timings. I do not yet understand the logic of instructions rpix
and ljmp
.
Some ljmp
and rpix
info for quick reference:
from https://en.wikibooks.org/wiki/Super_NES_Programming/Super_FX_tutorial#Instruction_Set
Instruction | Description | ALT(Hex) | CODE(HEX) | ARG | Length(B) | B | ATL1 | ALT2 | O/V | S | CY | Z | ROM | RAM | Cache | Classification | Note |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LJMP | Long jump | 3D | 0x9 | Rn | 2 | 0 | 0 | 0 | / | / | / | / | 6 | 6 | 2 | "Jump, Branch and Loop Instructions" | |
RPIX | Read pixel color | 3D | 0x4C | / | 2 | 0 | 0 | 0 | / | * | / | * | 24-80 | 24-78 | 20-74 | Plot/related instructions |
ROM/RAM/Cache columns are execution time in cycles.
LJMP seems pretty tight. o_O
Thanks @srg320. I have some findings.
RAM_CYCLES
for 10Mhz should be "010"
instead of "001"
. Otherwise it will access RAM with only 2 cycles instead of 3.
https://github.com/MiSTer-devel/SNES_MiSTer/blob/a6daf9b51ffc8777b04a161098884a642bc4c516/rtl/chip/GSU/GSU.vhd#L680-L681
4-color transparency should only check the lower 2 bits so this should be added: if COLR(1 downto 0) /= "00"
https://github.com/MiSTer-devel/SNES_MiSTer/blob/a6daf9b51ffc8777b04a161098884a642bc4c516/rtl/chip/GSU/GSU.vhd#L1123-L1131
I did some tests to figure out the PLOT pixel cache save logic:
PLOT
will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.
If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an STB
or STW
instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK.
For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.
Here are some test roms. sfx_stb
will use STB to write to RAM while the pixel cache is writing to RAM and reads the values after the SFX is stopped. The value $FF means the pixel cache write has overwritten the data written by STB. There is a cache instruction before the STB writes so you can ignore the NO CACHE text in the test rom.
sfx_speed_test_stb_plot
has removed some tests to add two STB/STW PLOT speed tests. The result of the STB PLOT test at 10Mhz with Cache On is $3FE-$400 for 4, 16 & 256 color. This is only 2 cycles more than the PLOT tests and STB is a 2 cycle opcodes so that means it didn't wait.
sfx_stb.zip sfx_speed_test_stb_plot.zip
Reference captures:
I did some tests to figure out the PLOT pixel cache save logic:
PLOT
will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.
That's interesting. Thanks.
If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an
STB
orSTW
instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK. For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.
I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.
I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.
The tests on the first page at 21Mhz with Cache on seem to be all fixed. The plot tests also look good. 21Mhz without Cache and 10Mhz still need to be fixed.
However the latest fixes caused everything executing from ROM at 21MHz to be 2 cycles too slow. From 5 to 7 cycles per byte. I have attached a test rom that runs the SFX code from ROM. Most results without cache should have the same results as the version that runs from Cart RAM, except for instructions that access RAM/ROM. For example PLOT
without cache should be faster executing from ROM than RAM.
Unfortunately I am unable to make reference captures for the ROM versions because that would need a modified Super FX cartridge.
I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.
Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:
ibt R0, #$34
iwt R3, #$1031
plots 7
cache
plot ; 8th plot, start pixel cache write (256-color 8 bytes)
stb (R3) ; pause pixel cache write, RAM buffer will write $34 to $701031
inc R0
; pixel cache will overwrite $701031 ($34) with $FF
I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.
Which ROM access do you mean? As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The GETB
instructions test ROM reading so we know what the results should be.
Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:
ibt R0, #$34 iwt R3, #$1031 plots 7 cache plot ; 8th plot, start pixel cache write (256-color 8 bytes) stb (R3) ; pause pixel cache write, RAM buffer will write $34 to $701031 inc R0 ; pixel cache will overwrite $701031 ($34) with $FF
Ok. I wonder what the result would be if you add one or two nop
before stb (R3)
.
As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The
GETB
instructions test ROM reading so we know what the results should be.
From this test you can see that in the Load/Store Word to/from RAM commands the second (MSB) access is shorter by 1 cycle. Perhaps when loading the cache (16 bytes sequential access) the access time is less than 5 cycles (some kind of burst mode).
Here is a test that measures how long it takes for an instruction to complete. It counts in a loop until the SFX is stopped so higher numbers mean it took longer. Small differences don't matter so much. One 21MHz cycle results in a difference of around $66 (102) loops. For example
nop
is $0134 in 21Mhz cache mode which is 1 cycle.add #
is 2 cycles and results in $0199 loops.It can be run on an original Super FX cart by swapping the cartridge while the console is on. The code runs in WRAM on SNES and Cart RAM/Cache on Super FX.
Here are reference captures of a StarFox cart (Mario Chip), Stunt Race FX (GSU1) and Yoshi's Island (GSU2) https://drive.google.com/drive/folders/15ac9U-x__n0AgOlWa3FGo5eEMShZYl5g?usp=sharing
The Mario Chip (v1) is unstable with reading/writing to Cart RAM. Some tests timeout which doesn't happen with the GSU chips.
Another difference with the Mario Chip is that the cache opcode will work immediately with GSU while it seems the Mario Chip needs 16 bytes to fill first so not all instructions are faster in this test with the StarFox cart.
The
ljmp
instruction is also quite weird. It takes much longer on the GSU chip than on Mario Chip. Not sure what's going on there.With cache off the MiSTer core runs faster in 10Mhz than 21Mhz which is strange.
Buttons: Left/Right: Switch to different tests Select: Toggle 10/21Mhz Y: Toggle High speed multiplier B: Toggle Cache
MiSTer captures: sfx_test_MiSTer_captures.zip https://drive.google.com/drive/folders/1noo2pRPoexCtVPgqSbzaexr61WOvjHNW?usp=sharing
Test rom: SuperFX.sfc.zip
Source: https://github.com/paulb-nl/sfx_speed_test