cnlohr / rv003usb

CH32V003 RISC-V Pure Software USB Controller
MIT License
439 stars 44 forks source link

encodings of xw loads/stores #11

Open jnk0le opened 1 year ago

jnk0le commented 1 year ago

made a wavedroms of some of those wch instructions (you can use https://wavedrom.com/editor.html for quick preview)

xw.c.lbu

{reg:[
 { bits: 2, name: 0x0, attr: ['xw.c.lbu'] },
 { bits: 3, name: 'rd\'' },
 { bits: 2, name: `uimm[2:1]` },
 { bits: 3, name: 'rs1\'' },
 { bits: 3, name: `uimm[0|4:3]` },
 { bits: 3, name: 0x1, attr: ['xw.c.lbu'] },
]}

toolchain agnostic solution/workaround for 0 offset:

.2byte (0x2000 | (<rs'> << 7) | (<rd'> << 2)) // replace <rs'> and <rd'>  with desired compressed registers

xw.c.lhu

{reg:[
 { bits: 2, name: 0x2, attr: ['xw.c.lhu'] },
 { bits: 3, name: 'rd\'' },
 { bits: 2, name: `uimm[2:1]` },
 { bits: 3, name: 'rs1\'' },
 { bits: 3, name: `uimm[5:3]` },
 { bits: 3, name: 0x1, attr: ['xw.c.lhu'] },
]}

toolchain agnostic solution/workaround for 0 offset:

.2byte (0x2002 | (<rs'> << 7) | (<rd'> << 2)) // replace <rs'> and <rd'>  with desired compressed registers

xw.c.sb

{reg:[
 { bits: 2, name: 0x0, attr: ['xw.c.sb'] },
 { bits: 3, name: 'rd\'' },
 { bits: 2, name: `uimm[2:1]` },
 { bits: 3, name: 'rs1\'' },
 { bits: 3, name: `uimm[0|4:3]` },
 { bits: 3, name: 0x5, attr: ['xw.c.sb'] },
]}

toolchain agnostic solution/workaround for 0 offset:

.2byte (0xa000 | (<rs'> << 7) | (<rs2'> << 2)) // replace <rs'> and <rs2'>  with desired compressed registers

xw.c.sh

{reg:[
 { bits: 2, name: 0x2, attr: ['xw.c.sh'] },
 { bits: 3, name: 'rd\'' },
 { bits: 2, name: `uimm[2:1]` },
 { bits: 3, name: 'rs1\'' },
 { bits: 3, name: `uimm[5:3]` },
 { bits: 3, name: 0x5, attr: ['xw.c.sh'] },
]}

toolchain agnostic solution/workaround for 0 offset:

.2byte (0xa002 | (<rs'> << 7) | (<rs2'> << 2)) // replace <rs'> and <rs2'>  with desired compressed registers

notes

immediate positions were tested out in just 12 SLOC:

//global
    volatile uint8_t ttt[256];

//main
    for(int i=0; i<256; i++) {
        ttt[i] = i;
    }

//asm
    la s0, ttt
    //c.lbu
    .2byte (0x2000 | (0 << 7) | (1 << 2)) //s0 addr // 0 offset // load to s1
    .2byte (0x2000 | (0 << 7) | (2 << 2) | (1<<5)) //s0 addr // ? offset // load to a0
    .2byte (0x2000 | (0 << 7) | (3 << 2) | (1<<6)) //s0 addr // ? offset // load to a1
    .2byte (0x2000 | (0 << 7) | (4 << 2) | (1<<10)) //s0 addr // ? offset // load to a2
    .2byte (0x2000 | (0 << 7) | (5 << 2) | (1<<11)) //s0 addr // ? offset // load to a3
    .2byte (0x2000 | (0 << 7) | (6 << 2) | (1<<12)) //s0 addr // ? offset // load to a4
    .2byte (0x2000 | (0 << 7) | (7 << 2) | 0x1c60) //s0 addr // max offset // load to a5

obraz

cnlohr commented 1 year ago

What does a complete example look like? I don't understand what you are suggesting with your test.

  1. Can this be applied for use in C somehow to an existing compiler?
  2. Is this just a suggestion for when editing .S files, allowing me to manually write out the 2-byte opcode of an instruction?
  3. Is .2byte really the right directive instead of .short?
jnk0le commented 1 year ago

That's assembly only

  1. those should be the same, my disassembly view is using the .2byte one, so went with this one. It should be also possible to do it by .insn directive but that requires preprocessing of immediate by a macro of some sort.
jnk0le commented 1 year ago

added stores. Should be enough to finish usb code

cnlohr commented 1 year ago

I still don't understand how to apply this code. I.e. can you show me an example of what I would include in my .S file and where to, for instance use xw.c.lbu a3, 3(a4)?

Specifically, I can't figure out how to use this:

{reg:[
 { bits: 2, name: 0x0, attr: ['xw.c.lbu'] },
 { bits: 3, name: 'rd\'' },
 { bits: 2, name: `uimm[2:1]` },
 { bits: 3, name: 'rs1\'' },
 { bits: 3, name: `uimm[0|4:3]` },
 { bits: 3, name: 0x1, attr: ['xw.c.lbu'] },
]}

I've never seen any syntax like it before. Is it only for creating a pretty instruction map on wavedrom?

cnlohr commented 1 year ago

Is the idea I just use .word 0xblah? (Also I still don't understand .2byte)

cnlohr commented 1 year ago

ALSO! Have you found encodings of any other not-in-base C instructions?

CaiB commented 1 year ago

Here's what I've found. I haven't tested any of the immediate multipliers, is that something you can provide insight into? I assume they are 1B for byte and 2B for half.

https://github.com/CaiB/CH32V003-Architecture-Exploration/blob/main/Analysis/NewInstructionNotes.md

jnk0le commented 1 year ago

I've never seen any syntax like it before. Is it only for creating a pretty instruction map on wavedrom?

yes, just the documentation purposes

Is the idea I just use .word 0xblah? (Also I still don't understand .2byte)

.word is 4 bytes. To emit 16 bit data you can use .2byte or .half or .short Here it doesn't matter wihich one you choose but often it is not obvious what is size of those "words", "shorts" etc.

ie. this: https://github.com/cnlohr/rv003usb/blob/master/rv003usb/rv003usb.S#L328

could be replaced with:

.2byte (0x2000 | (3 << 7) | (2 << 2)) //lbu  a0, 0(a1)
cnlohr commented 1 year ago

There is actually something a little more crucial than byte-oriented loads.

Unaligned 2-byte oriented loads.

Do you have any idea how that could be possible? Right now, my send code is noncompliant, in that when sending the CRC, the first byte of the CRC has an extra 6 cycle delay and then a 4 cycle delay toward the end :(.

It's because I can only do 1-byte reads. I heard there was unaligned-ok multi-byte reads. Do you know anything of this?

jnk0le commented 1 year ago

half loads need to be at least 2 byte aligned.

Do you have any idea how that could be possible? Right now, my send code is noncompliant, in that when sending the CRC, the first byte of the CRC has an extra 6 cycle delay and then a 4 cycle delay toward the end :(.

looking at the send code, I think you can do branch nesting on the path that has the most nop delay required

e.g. https://github.com/cnlohr/rv003usb/blob/master/rv003usb/rv003usb.S#L744

invert this branch so it jumps to not_done_sending_data (hoist rest of the code from here) and the rest of this "function" populated with content from done_sending_data. load_next_byte should get faster after shrinking those 2 instructions. (sign extension on load doesn't matter here, does it?)

If optimizations won't work, the CRC16 in IN transactions can be done "offline" (when filling buffer), like on many soft usb implementations.

jnk0le commented 1 year ago

also send_zero_bit will fall through one implicit nop (5 compressed insns + .balign 4)

cnlohr commented 1 year ago

@jnk0le is there any way I can mail you a free devkit so you can play with this and test it more yourself? I will try to apply your suggestion above to the code.

jnk0le commented 1 year ago

Thanks, but I can work fine with the wch devboard/tssop adapters. (especially when shipping from 'murica is more expensive than literally buing stuff on ali) Currently I'm limited by other factors.