some performance investigation

jnk0le commented 1 year ago

regarding your struggle with uncompressed instructions, I did some simple tests with 10 straightlined instructions: (I'll put template here once ready)

10x c.nop

0ws: 10 1ws: 10 2ws (invalid in RM): 21

10x big nop

0ws: 10 1ws: 20 2ws (invalid in RM): 41

c.nop + 9x nop

0ws: 11 1ws: 20 2ws (invalid in RM): 41

2x c.nop + 8x nop 0ws: 10 1ws: 18 2ws (invalid in RM): 37

3x c.nop + 7x nop

0ws: 11 1ws: 18 2ws (invalid in RM): 37

5x c.nop + 5x nop

0ws: 10 1ws: 16 2ws (invalid in RM): 33

c.nop + 8x nop + c.nop

0ws: 11 1ws: 18 2ws (invalid in RM): 37

repeating 1x nop then 2x c.nop (10 insn total)

0ws: 10 1ws: 14 2ws (invalid in RM): 29

repeating c.nop, nop (10 insn total) 0ws: 10 1ws: 16 2ws (invalid in RM): 33

10x c.lw (or c.sw) from sram 0ws: 20 1ws: 20 2ws (invalid in RM): 21

10x lw (or sw) from sram 0ws: 20 1ws: 20 2ws (invalid in RM): 41

unaligned lw/sw causes unaligned load/store exception word unaligned lb/sb doesn't add penalty cycles

    6x c.nop +
    c.slli a2, 1
    c.or a2, a0
    c.addi s1, -1 // # of bits left.
    andi a4, s1, 31 // mask off so we only look at bottom 7 bits

0ws: 11 1ws: 12 2ws (invalid in RM): 25

    5x c.nop +
    c.slli a2, 1
    c.or a2, a0
    c.addi s1, -1 // # of bits left.
    andi a4, s1, 31 // mask off so we only look at bottom 7 bits
    c.nop

0ws: 10 1ws: 12 2ws (invalid in RM): 25

looks like flash prefetching works, 4byte lines.

Note that you are using 48Mhz with 1ws config. you can put code in sram (e.g. .section .data.yourfunc, "x") for 0ws but here comes in the potential contention with DMA

jnk0le commented 1 year ago

BTW you were complaining about .align dumping to much padding.

https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_node/as_68.html

For other systems, including the i386 using a.out format, it is the number of low-order zero bits the location counter must have after advancement. For example `.align 3' advances the location counter until it a multiple of 8. If the location counter is already a multiple of 8, no change is needed.

need to use .balign if specifying byte alignment

jnk0le commented 1 year ago

BTW2

W/O HPE: 444ns W/ HPE: 589ns

"w/ HPE" option could go down by 140 ns (100ns in stream 2 case) as the stacking is not necessary (except s0, s1)

EDIT: note that irq code in sram might have some penalty for HPE. You can also use table free interrupts to get (probably) HPE-less case a bit down

cnlohr commented 1 year ago

Thank you for the clarification of the .balign thing. I give that a shot in my next livestream.

Perhaps if we make a new, extra docs folder what you've found, but in a more publicly readable and absorb-able format, i.e. with markdown table, etc.

Also, do you see any interesting stuff surrounding word alignment in your tests? I just regret that I have a hard time absorbing the information above to obtain the deeper understanding of what's really going on inside the chip. I am also going to send this to Macyler who will likely be doing other testing.

jnk0le commented 1 year ago

so far:

loads stores are 2 cycle

flash with 4 byte lines and linear prefetch working (could be in core or in flash like cm0 are doing, i'll check later) therefore only 16bit (single-cycle) instructions can execute at full speed

unaligned long instructions seem to (sometimes) have initial one cycle penalty and then execute normally

taken (to aligned location) branch is 3 cycles at 0ws and 5 cycles at 1 ws: 1 extra cycle for finishing prefetch of next instruction and another when waiting for target location

jnk0le commented 1 year ago

compressed branching:

0ws is always 3 cycle
1ws:
- branch from an earlier op is 4 cycles (and +1 due to only 1 insn in unaligned location)
- branch from later op is 5 cycles (and +1 due to only 1 insn in unaligned location)

Seems that the linear prefetch is triggered when 2nd instruction in bundle gets executed. Branch also doesn't care about already prefetched instructions (no benefit in short forward branches) backward branchng behaves exactly the same as forward

//compressed baseline:

    FLASH->ACTLR = FLASH_ACTLR_LATENCY_0;
    printf("0ws: %lu\n", ch32v_pipetest_tmpl());
    FLASH->ACTLR = FLASH_ACTLR_LATENCY_1;
    printf("1ws: %lu\n", ch32v_pipetest_tmpl()+2);

0ws: 20000 1ws: 22000

//1st op no skip: 0ws: 22000 //3 1ws: 26000 //5

//1st op over one:

    beqz a0, 2f
    nop

2:  nop
3:  nop

0ws: 21000 //-1 3 1ws: 24000 //-1 4

//1st op over two: 0ws: 20000 //-2 3 1ws: 24000 //-2 5

//1st op over three: 0ws: 19000 //-3 3 1ws: 22000 //-3 4

//1st op over five: 0ws: 17000 //-5 3 1ws: 20000 //-5 4

//2nd op no skip: 0ws: 22000 //3 1ws: 26000 //5

//2nd op over one:

    nop
    beqz a0, 3f

2:  nop
3:  nop

0ws: 21000 //-1 3 1ws: 26000 //-1 6

//2nd op over two: 0ws: 20000 //-2 3 1ws: 24000 //-2 5

//2nd op over three: 0ws: 19000 //-3 3 1ws: 24000 //-3 6

//2nd op over five: 0ws: 17000 //-5 3 1ws: 22000 //-5 6

//trim one nop from baseline 0ws: 19000 1ws: 20002

uncompressed

1 cycle penalty for unaligned branch.

//norvc baseline (allbig) 0ws: 20000 1ws: 38000

//norvc noskip 0ws: 22000 1ws: 40000

//norvc over one 0ws: 21000 1ws: 38000

//norvc over two 0ws: 20000 1ws: 36000

//unaligned norvc baseline (1x c.nop at beginning and end)

1: // replace code below
.option rvc
    nop
.option norvc
    nop
    [...]
    nop
.option rvc
    nop

    addi a5, a5, -1
    bnez a5, 1b // 3 cycle taken at 0ws, 5 at 1ws

0ws: 20001 1ws: 36000

//unaligned norvc noskip 0ws: 23000 1ws: 40000

//unaligned norvc over one 0ws: 22000 1ws: 38000

//unaligned norvc over two 0ws: 21000 1ws: 36000

jnk0le commented 1 year ago

2 cycle ops seem to swap the timmings of branching form earlier/later op EDIT: long 1 cycle instructions are not experiencing this swap

//baseline 0ws: 20000 1ws: 22000

//one lw sram 0ws: 21000 1ws: 22001 //4??

//trim one nop //lw from sram 0ws: 20000 1ws: 22000 //5??

//two lw sram 0ws: 22000 1ws: 24000 //5

//trim one nop //two lw from sram 0ws: 21000 1ws: 22001 //4

//one lw from flash 0ws: 21000 1ws: 24001

//two lw from flash 0ws: 22000 1ws: 28000

//trim one nop //lw from flash 0ws: 20000 1ws: 24000

//trim one nop //two lw from flash 0ws: 21000 1ws: 26001

two loads can be either

    lw a2, 0(a1)
    lw a2, 0(a1)

    nop
    nop

or

    lw a2, 0(a1)
    nop

    lw a2, 0(a1)
    nop

or swapped ops in bundles - no difference

jnk0le commented 1 year ago

if prefetcher is pressured enough with long instructions after 2 cycle ones, it seems to be back to 4e/5l

    lw a2, 0(a1)
    nop

    lw a2, 0(a1)
    nop

    nop
    nop

.option norvc
    nop
//.option rvc
    nop
.option rvc

//1 lw sram, 1 big nop 0ws: 21000 1ws: 22002 //4 from earlier

//1 lw sram, 1 big nop // trim one nop 0ws: 20000 1ws: 22000 //5 from later

//2 lw sram, 1 big nop 0ws: 22000 1ws: 24000 //e???

//2 lw sram, 1 big nop // trim one nop 0ws: 21000 1ws: 22001 //l???

//1 lw sram, 2 big nop 0ws: 21000 1ws: 24000 //l

//1 lw sram, 2 big nop // trim one nop 0ws: 20000 1ws: 22002 //e

//2 lw sram, 2 big nop 0ws: 22000 1ws: 24000 //l

//2 lw sram, 2 big nop // trim one nop 0ws: 21000 1ws: 22002 //e

//2lw 4big 0ws: 22000 1ws: 26000 //l

//2lw 4big //trim one 0ws: 21000 1ws: 24002 //e

//2lw 3big 0ws: 22000 1ws: 24002 //e

//2lw 3big //trim one 0ws: 21000 1ws: 24000 //l

cnlohr commented 1 year ago

I am really sorry, with your syntax, I do not understand what you are trying to say. Please use a different syntax to describe what you are finding? I am not able to extract any info from it :(

jnk0le commented 1 year ago

those are the cycle counts, for a given scenario, in this template https://github.com/jnk0le/random/blob/master/pipeline%20cycle%20test/ch32v_pipetest_tmpl.S (1 additional cycle per loop makes 1000 cycles, loop invariant stuff can be filtered out)

for a quick summary:

At 0ws: everything is cycle perfect.
At 1ws: the prefetching is weird enough that one cannot easily predict the execution/branch timmings. Especially the branch anomalies.

cnlohr commented 1 year ago

Are you on the Discord server. This feels like a largely parallel effort to what Macyler is doing.

jnk0le commented 1 year ago

I don't have discord account, though it's possible to see those channels without creating one.

cnlohr commented 1 year ago

I am not sure. This is the specific channel https://discord.com/channels/665433554787893289/1110284149450878979

CaiB commented 1 year ago

(I'm Macyler) My project is still a mess right now and I haven't documented the results so far, but it's here: https://github.com/CaiB/CH32V003-Architecture-Exploration/tree/main

What I've found so far is that it seems like alignment makes little difference to in-order execution.
Executing the same instruction 3 and 7 times in a row: From this it looks like you get 2 non-compressed instructions in a row, then any more will slow you down to 2 CPI. lui being a weird exception.
I haven't done any more testing in this direction, currently trying to set up opcode fuzzing to try and find some of the undocumented instructions.

jnk0le commented 1 year ago

regarding the compressed loads etc. it should be stuff from Zce v0.50 (or older) https://github.com/riscv/riscv-code-size-reduction/releases/tag/V0.50.1-TOOLCHAIN-DEV

The turnaround of spec to shipped silicon is about right in this case.

lui being a weird exception.

it could get compressed opcode (c.lui can address all registers except x0, and x2), then everything is as expected

cnlohr commented 1 year ago

Do you know that it is? Or just a guess?

Also, seeing C.NOT has me all like

jnk0le commented 1 year ago

ok, tried the 0.50 c.lb

    li a0, 0x12345678
    sw a0, 1024(gp)
    addi a1, gp, 1024

    .2byte (0x2002 | (3 << 7) | (0 << 2)) //a1 addr // 0 offset // load to s0
    .2byte (0x2002 | (3 << 7) | (1 << 2) | (1 << 11)) //a1 addr // 1 offset // load to s1

and got result of v0.70 cm.lhu, so that's definitely Zcmb. Not sure about sp ones as those were dropped from 0.70 Zce obraz

jnk0le commented 1 year ago

v0.70 c.not -> illegal instruction v0.50 c.not -> that's c.lbu of v0.70 Zcmb

cnlohr commented 1 year ago

You need to use mmooooorreeeee worrrrrdssssss. I don't have enough background to know what you are referring to.

What is v0.50 c.not? Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?

jnk0le commented 1 year ago

Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?

yes, it has the same encoding of already implemented Zcmb instructions (which did load somethig into a5)

jnk0le commented 1 year ago

wait, that's not zcmb, bit 12 is part of an offset.

there is also no c.lb and ~c.sb~ c.sh in their "documentation" of xw extension.

E: (c.lbu and c.lhu are there)

cnlohr commented 1 year ago

wait, that's not zcmb, bit 12 is part of an offset.

there is also no c.lb and c.sb in their "documentation" of xw extension.

Wait really?!? Bleh, I feel like I need a guide for all of the opcodes that can be used.

I really want to enable bit timing correction.

cnlohr / rv003usb

some performance investigation #5

compressed branching:

uncompressed