Open jnk0le opened 1 year ago
BTW you were complaining about .align
dumping to much padding.
https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_node/as_68.html
For other systems, including the i386 using a.out format, it is the number of low-order zero bits the location counter must have after advancement. For example `.align 3' advances the location counter until it a multiple of 8. If the location counter is already a multiple of 8, no change is needed.
need to use .balign
if specifying byte alignment
BTW2
W/O HPE: 444ns W/ HPE: 589ns
"w/ HPE" option could go down by 140 ns (100ns in stream 2 case) as the stacking is not necessary (except s0, s1)
EDIT: note that irq code in sram might have some penalty for HPE. You can also use table free interrupts to get (probably) HPE-less case a bit down
Thank you for the clarification of the .balign
thing. I give that a shot in my next livestream.
Perhaps if we make a new, extra docs folder what you've found, but in a more publicly readable and absorb-able format, i.e. with markdown table, etc.
Also, do you see any interesting stuff surrounding word alignment in your tests? I just regret that I have a hard time absorbing the information above to obtain the deeper understanding of what's really going on inside the chip. I am also going to send this to Macyler who will likely be doing other testing.
so far:
loads stores are 2 cycle
flash with 4 byte lines and linear prefetch working (could be in core or in flash like cm0 are doing, i'll check later) therefore only 16bit (single-cycle) instructions can execute at full speed
unaligned long instructions seem to (sometimes) have initial one cycle penalty and then execute normally
taken (to aligned location) branch is 3 cycles at 0ws and 5 cycles at 1 ws: 1 extra cycle for finishing prefetch of next instruction and another when waiting for target location
Seems that the linear prefetch is triggered when 2nd instruction in bundle gets executed. Branch also doesn't care about already prefetched instructions (no benefit in short forward branches) backward branchng behaves exactly the same as forward
//compressed baseline:
FLASH->ACTLR = FLASH_ACTLR_LATENCY_0;
printf("0ws: %lu\n", ch32v_pipetest_tmpl());
FLASH->ACTLR = FLASH_ACTLR_LATENCY_1;
printf("1ws: %lu\n", ch32v_pipetest_tmpl()+2);
0ws: 20000 1ws: 22000
//1st op no skip: 0ws: 22000 //3 1ws: 26000 //5
//1st op over one:
beqz a0, 2f
nop
2: nop
3: nop
0ws: 21000 //-1 3 1ws: 24000 //-1 4
//1st op over two: 0ws: 20000 //-2 3 1ws: 24000 //-2 5
//1st op over three: 0ws: 19000 //-3 3 1ws: 22000 //-3 4
//1st op over five: 0ws: 17000 //-5 3 1ws: 20000 //-5 4
//2nd op no skip: 0ws: 22000 //3 1ws: 26000 //5
//2nd op over one:
nop
beqz a0, 3f
2: nop
3: nop
0ws: 21000 //-1 3 1ws: 26000 //-1 6
//2nd op over two: 0ws: 20000 //-2 3 1ws: 24000 //-2 5
//2nd op over three: 0ws: 19000 //-3 3 1ws: 24000 //-3 6
//2nd op over five: 0ws: 17000 //-5 3 1ws: 22000 //-5 6
//trim one nop from baseline 0ws: 19000 1ws: 20002
1 cycle penalty for unaligned branch.
//norvc baseline (allbig) 0ws: 20000 1ws: 38000
//norvc noskip 0ws: 22000 1ws: 40000
//norvc over one 0ws: 21000 1ws: 38000
//norvc over two 0ws: 20000 1ws: 36000
//unaligned norvc baseline (1x c.nop at beginning and end)
1: // replace code below
.option rvc
nop
.option norvc
nop
[...]
nop
.option rvc
nop
addi a5, a5, -1
bnez a5, 1b // 3 cycle taken at 0ws, 5 at 1ws
0ws: 20001 1ws: 36000
//unaligned norvc noskip 0ws: 23000 1ws: 40000
//unaligned norvc over one 0ws: 22000 1ws: 38000
//unaligned norvc over two 0ws: 21000 1ws: 36000
2 cycle ops seem to swap the timmings of branching form earlier/later op EDIT: long 1 cycle instructions are not experiencing this swap
//baseline 0ws: 20000 1ws: 22000
//one lw sram 0ws: 21000 1ws: 22001 //4??
//trim one nop //lw from sram 0ws: 20000 1ws: 22000 //5??
//two lw sram 0ws: 22000 1ws: 24000 //5
//trim one nop //two lw from sram 0ws: 21000 1ws: 22001 //4
//one lw from flash 0ws: 21000 1ws: 24001
//two lw from flash 0ws: 22000 1ws: 28000
//trim one nop //lw from flash 0ws: 20000 1ws: 24000
//trim one nop //two lw from flash 0ws: 21000 1ws: 26001
two loads can be either
lw a2, 0(a1)
lw a2, 0(a1)
nop
nop
or
lw a2, 0(a1)
nop
lw a2, 0(a1)
nop
or swapped ops in bundles - no difference
if prefetcher is pressured enough with long instructions after 2 cycle ones, it seems to be back to 4e/5l
lw a2, 0(a1)
nop
lw a2, 0(a1)
nop
nop
nop
.option norvc
nop
//.option rvc
nop
.option rvc
//1 lw sram, 1 big nop 0ws: 21000 1ws: 22002 //4 from earlier
//1 lw sram, 1 big nop // trim one nop 0ws: 20000 1ws: 22000 //5 from later
//2 lw sram, 1 big nop 0ws: 22000 1ws: 24000 //e???
//2 lw sram, 1 big nop // trim one nop 0ws: 21000 1ws: 22001 //l???
//1 lw sram, 2 big nop 0ws: 21000 1ws: 24000 //l
//1 lw sram, 2 big nop // trim one nop 0ws: 20000 1ws: 22002 //e
//2 lw sram, 2 big nop 0ws: 22000 1ws: 24000 //l
//2 lw sram, 2 big nop // trim one nop 0ws: 21000 1ws: 22002 //e
//2lw 4big 0ws: 22000 1ws: 26000 //l
//2lw 4big //trim one 0ws: 21000 1ws: 24002 //e
//2lw 3big 0ws: 22000 1ws: 24002 //e
//2lw 3big //trim one 0ws: 21000 1ws: 24000 //l
I am really sorry, with your syntax, I do not understand what you are trying to say. Please use a different syntax to describe what you are finding? I am not able to extract any info from it :(
those are the cycle counts, for a given scenario, in this template https://github.com/jnk0le/random/blob/master/pipeline%20cycle%20test/ch32v_pipetest_tmpl.S (1 additional cycle per loop makes 1000 cycles, loop invariant stuff can be filtered out)
for a quick summary:
Are you on the Discord server. This feels like a largely parallel effort to what Macyler is doing.
I don't have discord account, though it's possible to see those channels without creating one.
I am not sure. This is the specific channel https://discord.com/channels/665433554787893289/1110284149450878979
(I'm Macyler) My project is still a mess right now and I haven't documented the results so far, but it's here: https://github.com/CaiB/CH32V003-Architecture-Exploration/tree/main
What I've found so far is that it seems like alignment makes little difference to in-order execution.
Executing the same instruction 3 and 7 times in a row:
From this it looks like you get 2 non-compressed instructions in a row, then any more will slow you down to 2 CPI. lui
being a weird exception.
I haven't done any more testing in this direction, currently trying to set up opcode fuzzing to try and find some of the undocumented instructions.
regarding the compressed loads etc. it should be stuff from Zce v0.50 (or older) https://github.com/riscv/riscv-code-size-reduction/releases/tag/V0.50.1-TOOLCHAIN-DEV
The turnaround of spec to shipped silicon is about right in this case.
lui being a weird exception.
it could get compressed opcode (c.lui
can address all registers except x0, and x2), then everything is as expected
Do you know that it is? Or just a guess?
Also, seeing C.NOT has me all like
ok, tried the 0.50 c.lb
li a0, 0x12345678
sw a0, 1024(gp)
addi a1, gp, 1024
.2byte (0x2002 | (3 << 7) | (0 << 2)) //a1 addr // 0 offset // load to s0
.2byte (0x2002 | (3 << 7) | (1 << 2) | (1 << 11)) //a1 addr // 1 offset // load to s1
and got result of v0.70 cm.lhu, so that's definitely Zcmb. Not sure about sp
ones as those were dropped from 0.70 Zce
v0.70 c.not -> illegal instruction v0.50 c.not -> that's c.lbu of v0.70 Zcmb
You need to use mmooooorreeeee worrrrrdssssss. I don't have enough background to know what you are referring to.
What is v0.50 c.not? Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?
Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?
yes, it has the same encoding of already implemented Zcmb instructions (which did load somethig into a5)
wait, that's not zcmb, bit 12 is part of an offset.
there is also no c.lb and ~c.sb~ c.sh in their "documentation" of xw extension.
E: (c.lbu and c.lhu are there)
wait, that's not zcmb, bit 12 is part of an offset.
there is also no c.lb and c.sb in their "documentation" of xw extension.
Wait really?!? Bleh, I feel like I need a guide for all of the opcodes that can be used.
I really want to enable bit timing correction.
regarding your struggle with uncompressed instructions, I did some simple tests with 10 straightlined instructions: (I'll put template here once ready)
10x c.nop
0ws: 10 1ws: 10 2ws (invalid in RM): 21
10x big nop
0ws: 10 1ws: 20 2ws (invalid in RM): 41
c.nop + 9x nop
0ws: 11 1ws: 20 2ws (invalid in RM): 41
2x c.nop + 8x nop 0ws: 10 1ws: 18 2ws (invalid in RM): 37
3x c.nop + 7x nop
0ws: 11 1ws: 18 2ws (invalid in RM): 37
5x c.nop + 5x nop
0ws: 10 1ws: 16 2ws (invalid in RM): 33
c.nop + 8x nop + c.nop
0ws: 11 1ws: 18 2ws (invalid in RM): 37
repeating 1x nop then 2x c.nop (10 insn total)
0ws: 10 1ws: 14 2ws (invalid in RM): 29
repeating c.nop, nop (10 insn total) 0ws: 10 1ws: 16 2ws (invalid in RM): 33
10x c.lw (or c.sw) from sram 0ws: 20 1ws: 20 2ws (invalid in RM): 21
10x lw (or sw) from sram 0ws: 20 1ws: 20 2ws (invalid in RM): 41
unaligned lw/sw causes unaligned load/store exception word unaligned lb/sb doesn't add penalty cycles
0ws: 11 1ws: 12 2ws (invalid in RM): 25
0ws: 10 1ws: 12 2ws (invalid in RM): 25
looks like flash prefetching works, 4byte lines.
Note that you are using 48Mhz with 1ws config. you can put code in sram (e.g.
.section .data.yourfunc, "x"
) for 0ws but here comes in the potential contention with DMA