YosysHQ / picorv32

PicoRV32 - A Size-Optimized RISC-V CPU
ISC License
3.14k stars 755 forks source link

Performance #126

Closed drtrigon closed 5 years ago

drtrigon commented 5 years ago

Naiive question: is it possible that the following assembler code (excluding the variable declaration) needs 256 cycles, every further iteration (++us) are additional 197 cycles?

unsigned int us = 1;
__asm__ __volatile__ (
      "1: addi %0,%0,-1" "\n\t"
      "bnez %0,1b" : "=r" (us) : "r" (us)
    );

I timed it using

  __asm__ volatile ("rdcycle %0" : "=r"(a));

with a beeing a uint32_t.

This seems like a lot just for an add and a compare/branch also considering https://github.com/cliffordwolf/picorv32#cycles-per-instruction-performance.

The reason why I'm asking is I try to implement and equivalent to the Arduino delayMicroseconds and this delay code takes between 16-21 us on an Alhambra II board running at 12 MHz.

daveshah1 commented 5 years ago

Are you running the code out of SPI flash by any chance?

drtrigon commented 5 years ago

Yes I am. Is there a way to speed this up? Cache or anything else?

What setup is needed to achive the numbers given in https://github.com/cliffordwolf/picorv32#cycles-per-instruction-performance ?

cliffordwolf commented 5 years ago

Cache or anything else?

Yes, you can create a cache. Or you can just copy all performance-critical code to RAM, or execute from a ROM. Whatever works for you. These are decisions you have to make about the system you are building and have nothing to do with PicoRV32 itself, as PicoRV32 is just the processor core.

What setup is needed to achieve the numbers given in [..]

Just a fast RAM. See https://github.com/cliffordwolf/picorv32/blob/master/dhrystone/testbench.v for the setup.

drtrigon commented 3 years ago

Sorry for comming back to this late, but my time is restricted... ;)

The project is: https://github.com/drtrigon/fpgarduino-icestorm

I still would like to make this work. As hardware I use an Alhambra board. As development tool icestudio. I can guess of 2 possible ways to speed this up:

  1. use a cache as mentioned; can you give me some hints on how to implement this in icestudio?
  2. use a PLL to generate a faster clock; here I would need some hints on how to implement and use a faster clock?
  3. use a template for delayMicroseconds that basically just adds the correct number of nops

@cliffordwolf: The reason why I asked here is beacuse cache controller (L1, L2) is usually implemented in the processor.