Super FX speed test - Githubissues

paulb-nl commented 2 years ago

Here is a test that measures how long it takes for an instruction to complete. It counts in a loop until the SFX is stopped so higher numbers mean it took longer. Small differences don't matter so much. One 21MHz cycle results in a difference of around $66 (102) loops. For example nop is $0134 in 21Mhz cache mode which is 1 cycle. add # is 2 cycles and results in $0199 loops.

It can be run on an original Super FX cart by swapping the cartridge while the console is on. The code runs in WRAM on SNES and Cart RAM/Cache on Super FX.

Here are reference captures of a StarFox cart (Mario Chip), Stunt Race FX (GSU1) and Yoshi's Island (GSU2) https://drive.google.com/drive/folders/15ac9U-x__n0AgOlWa3FGo5eEMShZYl5g?usp=sharing

The Mario Chip (v1) is unstable with reading/writing to Cart RAM. Some tests timeout which doesn't happen with the GSU chips.

Another difference with the Mario Chip is that the cache opcode will work immediately with GSU while it seems the Mario Chip needs 16 bytes to fill first so not all instructions are faster in this test with the StarFox cart.

The ljmp instruction is also quite weird. It takes much longer on the GSU chip than on Mario Chip. Not sure what's going on there.

With cache off the MiSTer core runs faster in 10Mhz than 21Mhz which is strange.

Buttons: Left/Right: Switch to different tests Select: Toggle 10/21Mhz Y: Toggle High speed multiplier B: Toggle Cache

MiSTer captures: sfx_test_MiSTer_captures.zip https://drive.google.com/drive/folders/1noo2pRPoexCtVPgqSbzaexr61WOvjHNW?usp=sharing

Test rom: SuperFX.sfc.zip

Source: https://github.com/paulb-nl/sfx_speed_test

FitzRoyX commented 2 years ago

Nice test! Would it be possible to color the text green/red based to reflect correct/incorrect based on the hw results?

paulb-nl commented 2 years ago

That would get a bit complicated with the 2 different chip versions and 8 setting combinations. It is also complicated to decide if a small difference is acceptable because some tests like the plot tests need to be accurate to 1/8th of a cycle. The cycle count can change after 8 plot instructions because then it will write the pixel data to ram.

paulb-nl commented 2 years ago

Here is a comparison of some differences. These are not all the differences but I think it is enough for now :)

The cycles mentioned below with the 10MHz tests are 10MHz cycles so 1 cycle = 2x 21MHz cycles.

MiSTer vs Stunt Race FX (GSU1):

10MHz, MS0, No cache: Everything is too fast. NOP $72F-$4C8 = $267 = 3 cycles too fast ADC # (2 NOPS) $993-$660 = $333 = 4 cycles too fast MiSTer NOP vs 2 NOPS $660 - $4C8 = $198 = 2 cycles GSU NOP vs 2 NOPS $993 - $72F = $264 = 3 cycles

10MHz, MS0, Cache on FMULT $8C9 - $7F9 = $D0 = 1 cycle too fast GETB* $7FC - $662 = $19A = 2 cycles too fast GETB_2 $730 - $595 = $19B = 2 cycles too fast LDB $663 - $595 = $CE = 1 cycle too fast LDW $730 - $661 = $CF = 1 cycle too fast LM $994 - $8C5 = $CF = 1 cycle too fast LMS $8C7 - $7F8 = $CF = 1 cycle too fast LMULT $994 - $8C5 = $CF = 1 cycle too fast SBK $4CB - $3FE = $CD = 1 cycle too fast SM $663 - $595 = $CE = 1 cycle too fast SMS $597 - $4C9 = $CE = 1 cycle too fast STW $4CB - $3FD = $CE = 1 cycle too fast

10MHz, MS1, Cache on FMULT $598 - $4C9 = $CF = 1 cycle too fast LMULT $663 - $595 = $CE= 1 cycle too fast

10MHz PLOT, Cache on PLOT 4 color: $267 - $29A = -$33 = 0.25 cycles too slow (2 cycles every 8 plots?) PLOT 16 color: $266 - $2FE = -$98 = 0.75 cycles too slow (6 cycles every 8 plots?) PLOT 256 color: $280 - $3CA = -$14A = 1.625 cycles too slow (13 cycles every 8 plots?)

The PLOT -> LOOP-> NOP loop takes 3 cycles so 8 plots takes 8x3= 24 cycles. This is enough cycles to save the secondary pixel cache to RAM for 4 & 16 color data without waiting so PLOT should only take 1 cycle. For 256 color PLOT is 0.125 cycles slower ($280 vs $266) so it seems to wait 1 cycle every 8 plots.

PLOT with color #$FC should be treated as no-plot in 4 color transparent mode since low 2 bits are zero.

21MHz, MS0, No cache FMULT $AC5 - $7F8 = $2CD = 7 cycles too fast GETB* $CC4 - $BF4 = $D0 = 2 cycles too fast GETB_2 $AC6-$9F6 = $D0 = 2 cycles too fast LDB $A60 - $C5A = -$1FA = 5 cycles too slow LDW $9F9 - $C5A = -$261 = 6 cycles too slow LM $FF4 - $1253 = -$25F = 6 cycles too slow LMS $DF6 - $1055 = -$25F = 6 cycles too slow LMULT $CC4 - $9F6 = $2CE = 7 cycles too fast MULT $861 - $7F8 = $69 = 1 cycle too fast SBK $BF8 - $C5A = -$62 = 1 cycle too slow SM $FF4 - $1055 = -$61 = 1 cycle too slow SMS $DF6 - $E58 = -$62 = 1 cycle too slow STW $9F9 - $A5C = -$63 = 1 cycle too slow UMULT $861 - $7F8 = $69 = 1 cycle too fast

21MHz, MS1, No cache FMULT $92D - $7F8 = $135 = 3 cycles too fast LMULT $B2B - $9F6 = $135 = 3 cycles too fast

21MHz, MS0, Cache on FMULT $466 - $3FD = $69 = 1 cycle too fast GETB* $4CB - $3FE = $CD = 2 cycles too fast GETB_2 $465- $397 = $CE = 2 cycles too fast LDW $531 - $595 = -$64 = 1 cycle too slow LM $663 - $6C7 = -$64 = 1 cycle too slow LMS $5FD - $661 = -$64 = 1 cycle too slow LMULT $4CB - $463 = $68 = 1 cycle too fast SBK $3FF - $463 = -$64 = 1 cycle too slow SM $4CB - $52F = -$64 = 1 cycle too slow SMS $465 - $4C9 = -$64 = 1 cycle too slow STW $3FF - $463 = -$64 = 1 cycle too slow

21MHz, MS1, Cache on FMULT $2CD - $3FD = -$130 = 3 cycles too slow LMULT $332 - $463 = -$131 = 3 cycles too slow

21MHz PLOT Cache on PLOT 4 color: $134 - $19B = -$67 = 1 cycle too slow (8 cycles every 8 plots?) PLOT 16 color: $133 - $218 = -$E5 = 2.25 cycles too slow (18 cycles every 8 plots?) PLOT 256 color: $20C - $317 = -$10B = 2.625 cycles too slow (21 cycles every 8 plots?)

sorgelig commented 2 years ago

If i remember right GSU code was written as a functional analog, not cycle accurate. So, most likely it needs rework with cycle accuracy.

paulb-nl commented 2 years ago

With this list it may seem that not much is accurate but many of the instructions in 21MHz mode (and 10Mhz with cache) are accurate.

Almost all of the instructions that are not accurate are about reading/writing from ROM/RAM and the multiplier instructions.

srg320 commented 2 years ago

Fixed some timings. I do not yet understand the logic of instructions rpixand ljmp.

birdybro commented 2 years ago

Some ljmp and rpix info for quick reference:

from https://en.wikibooks.org/wiki/Super_NES_Programming/Super_FX_tutorial#Instruction_Set

Instruction	Description	ALT(Hex)	CODE(HEX)	ARG	Length(B)	B	ATL1	ALT2	O/V	S	CY	Z	ROM	RAM	Cache	Classification	Note
LJMP	Long jump	3D	0x9	Rn	2	0	0	0	/	/	/	/	6	6	2	"Jump, Branch and Loop Instructions"
RPIX	Read pixel color	3D	0x4C	/	2	0	0	0	/	*	/	*	24-80	24-78	20-74	Plot/related instructions

ROM/RAM/Cache columns are execution time in cycles.

LJMP seems pretty tight. o_O

paulb-nl commented 2 years ago

Thanks @srg320. I have some findings.

RAM_CYCLES for 10Mhz should be "010" instead of "001". Otherwise it will access RAM with only 2 cycles instead of 3. https://github.com/MiSTer-devel/SNES_MiSTer/blob/a6daf9b51ffc8777b04a161098884a642bc4c516/rtl/chip/GSU/GSU.vhd#L680-L681

4-color transparency should only check the lower 2 bits so this should be added: if COLR(1 downto 0) /= "00" https://github.com/MiSTer-devel/SNES_MiSTer/blob/a6daf9b51ffc8777b04a161098884a642bc4c516/rtl/chip/GSU/GSU.vhd#L1123-L1131

I did some tests to figure out the PLOT pixel cache save logic: PLOT will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.

If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an STB or STW instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK.

For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.

Here are some test roms. sfx_stb will use STB to write to RAM while the pixel cache is writing to RAM and reads the values after the SFX is stopped. The value $FF means the pixel cache write has overwritten the data written by STB. There is a cache instruction before the STB writes so you can ignore the NO CACHE text in the test rom.

sfx_speed_test_stb_plot has removed some tests to add two STB/STW PLOT speed tests. The result of the STB PLOT test at 10Mhz with Cache On is $3FE-$400 for 4, 16 & 256 color. This is only 2 cycles more than the PLOT tests and STB is a 2 cycle opcodes so that means it didn't wait.

sfx_stb.zip sfx_speed_test_stb_plot.zip

Reference captures: sfx_speed_test_StuntRaceFx_10MHz_plot_cache_stb sfx_speed_test_StuntRaceFx_21MHz_plot_cache_stb

sfx_stb_StuntRaceFx_10Mhz sfx_stb_StuntRaceFx_21Mhz

srg320 commented 2 years ago

I did some tests to figure out the PLOT pixel cache save logic: PLOT will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.

That's interesting. Thanks.

If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an STB or STW instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK. For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.

I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.

srg320 commented 2 years ago

I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.

paulb-nl commented 2 years ago

The tests on the first page at 21Mhz with Cache on seem to be all fixed. The plot tests also look good. 21Mhz without Cache and 10Mhz still need to be fixed.

However the latest fixes caused everything executing from ROM at 21MHz to be 2 cycles too slow. From 5 to 7 cycles per byte. I have attached a test rom that runs the SFX code from ROM. Most results without cache should have the same results as the version that runs from Cart RAM, except for instructions that access RAM/ROM. For example PLOT without cache should be faster executing from ROM than RAM.

SuperFX_rom.sfc.zip

Unfortunately I am unable to make reference captures for the ROM versions because that would need a modified Super FX cartridge.

I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.

Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:

    ibt R0, #$34
    iwt R3, #$1031

    plots 7
    cache
    plot ; 8th plot, start pixel cache write (256-color 8 bytes)

    stb (R3) ;  pause pixel cache write, RAM buffer will write $34 to $701031
    inc R0

    ; pixel cache will overwrite $701031 ($34) with $FF

I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.

Which ROM access do you mean? As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The GETB instructions test ROM reading so we know what the results should be.

srg320 commented 2 years ago

Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:
    ibt R0, #$34
    iwt R3, #$1031

    plots 7
    cache
    plot ; 8th plot, start pixel cache write (256-color 8 bytes)

    stb (R3) ;  pause pixel cache write, RAM buffer will write $34 to $701031
    inc R0

    ; pixel cache will overwrite $701031 ($34) with $FF

Ok. I wonder what the result would be if you add one or two nop before stb (R3).

As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The GETB instructions test ROM reading so we know what the results should be.

From this test you can see that in the Load/Store Word to/from RAM commands the second (MSB) access is shorter by 1 cycle. Perhaps when loading the cache (16 bytes sequential access) the access time is less than 5 cycles (some kind of burst mode).

MiSTer-devel / SNES_MiSTer

Super FX speed test #340