chipsalliance / Cores-VeeR-EH1

VeeR EH1 core
Apache License 2.0
810 stars 219 forks source link

SweRV on genesys2 #48

Closed wehob closed 4 years ago

wehob commented 4 years ago

I tried to run the swerv core with Configure in "SweRV_CoreMark_Benchmarking.pdf" on genesys2 fpga board, and it could successfully run the hello example through openocd; But when I tried to run the dhrystone with "number of runs" = 1000000 through openocd, it took 334 seconds and the DMIPS was very low.

So, could somebody tell me where the question is? The change all I made was:

  1. change the constraint file
  2. update the IP for genesys2 FPGA

The dryhstone code was from https://github.com/sifive/benchmark-dhrystone I copied the makefile from hello example for dryhstone Deleting -nostdlib and change all hello.c to dhry_1.c dhry_2.c Thanks!

jrahmeh commented 4 years ago

Make sure the compiler is invoked with the -O2 flags. Default compilation produces non-optimized code. To get decent performance you have to use -O2 as allowed/required by the benchmark.

wehob commented 4 years ago

I had added the -O2 flags, but the time spent had increased to 358 seconds.

agrobman commented 4 years ago

what frequency is your fpga design running? Where do you place data/stack - make sure data & stack are located inside DCCM, you need to scale DMIPS numbers with frequency.

Check that your code is placed to a region marked as cacheable in MRAC register and your design have IC enabled ...

agrobman commented 4 years ago

You probably need review/update your crt0 code and write 0x55555555 to MRAC CSR to enable caching: li t0,0x55555555 csrw mrac, t0

prior calling main

wehob commented 4 years ago

Frequency is 40Mhz My configuration is

./swerv.config -set reset_vec=0xf0090000 -set=iccm_enable=1
-unset=icache_enable -iccm_region=0xf -iccm_offset=0x90000
-iccm_size=64 -dccm_region=0xf -dccm_offset=0x80000
-dccm_size=64 -btb_size=512 -bht_size=2048

so I changed link.ld to be like

MEMORY
{
  iccm  (wxa!ri) : ORIGIN = 0x00090000, LENGTH = 64k
  dccm  (wxa!ri) : ORIGIN = 0x00080000, LENGTH = 64k
}

SECTIONS
{
  __stack_size = DEFINED(__stack_size) ? __stack_size : 8K;

  .text.init    :
  {
    *(.text.init)
    . = ALIGN(8);
  } > iccm

  .text :
  {
    *(.text.unlikely .text.unlikely.*)
    *(.text.startup .text.startup.*)
    *(.text .text.*)
    *(.gnu.linkonce.t.*)
    . = ALIGN(8);
  } > iccm

  .rodata   :
  {
    *(.rdata)
    *(.rodata .rodata.*)
    *(.gnu.linkonce.r.*)
    . = ALIGN(8);
  } > dccm

  .data :
  {
    *(.data .data.*)
    *(.gnu.linkonce.d.*)
    . = ALIGN(8);
  } > dccm
the left section .sdata, .sbss, .bss, .stack are all > dccm

But the GDB cannot read correctly and displayed

0x00000000 in ?? ()

which is not like what it was

_start () at /home/vdsl/swerv_eh1_fpga/software/bsp/startup.S:26
agrobman commented 4 years ago

How do you load data/code to DCCM/ICCM before run? Debugger with openOCD may not know how to access the CCMs

wehob commented 4 years ago

How to load data/code to DCCM/ICCM is specified in link.ld or startup.S files? My link.ld file has changed to what I previously said(other section were remain same),and startup.S is the original file that github provided.

jrahmeh commented 4 years ago

If you put the iccm and dccm in region 0xf, then your memory map should be like this:

MEMORY { iccm (wxa!ri) : ORIGIN = 0xf0090000, LENGTH = 64k dccm (wxa!ri) : ORIGIN = 0xf0080000, LENGTH = 64k }

wehob commented 4 years ago

I have tried that version of link.ld but still not work. The other issue says that it must use dma to load/store from iccm, is it correct? At least in my case.

agrobman commented 4 years ago

I have tried that version of link.ld but still not work. The other issue says that it must use dma to load/store from iccm, is it correct? At least in my case.

This is correct, CPU has no internal data path from ICCM to LSU (load/store unit)

wehob commented 4 years ago

I think solution to my problem might be adding the DMA (cause LSU cannot connect CCMs) Below are new questions:

  1. Why the program only can work when "RAM(wxa!ri) : ORIGIN = 0x00000000, LENGTH = 64k"? My reset_vec=0xf0090000 is not pointing to 0x0.
  2. What memory the data reside in? The manual says that it only have local memory ICCM/DCCM in this core but there has no data path between ICCM/DCCM and LSU so it should't work (but it does)
agrobman commented 4 years ago

What ( memories) do you have connected to the CPU buses ? How do you download your test to the HW? Most likely you have an external memory slave near address 0.

If you have no DMA in you system, there are two ways to use DCCM:

1) Copy initial values of data variables from external memory by the CPU to DCCM in startup code;

2) Use debugger to preload DCCM ( debugger needs to support abstract memory commands) To use ICCM without DMA there are also two ways:

1) Your system should have data path in the interconnect from LSU to DMA CPU port, then you can use CPU code to copy from external memory to ICCM;

2) Use debugger to preload ICCM ( but it should use abstract memory commands to be able to access ICCM)

However, performance numbers for Dhrystone differ not significantly if the code is cacheable or resides in ICCM.

I think solution to my problem might be adding the DMA (cause LSU cannot connect CCMs) Below are new questions:

  1. Why the program only can work when "RAM(wxa!ri) : ORIGIN = 0x00000000, LENGTH = 64k"? My reset_vec=0xf0090000 is not pointing to 0x0.
  2. What memory the data reside in? The manual says that it only have local memory ICCM/DCCM in this core but there has no data path between ICCM/DCCM and LSU so it should't work (but it does)
wehob commented 4 years ago

I finally finished loading data to dccm by startup and the DMIPS became 1.9 But the uart became like this

Dhrystone Benc
Execution endsustruhDrsoe:C
C__lb      Balsue ntebnhak
                sol e  B
DISMz                  17712 bv    7

And if I didn't move rodata to dccm it became a little better

Dhrystone Benchmark, Version 2.1 (Language: C)
Execution endsts, 1000 runs through Dhrystone)
DMIPS/Mhz:                                   56792Gme as above

And if I removed all "\n", it can display all message but my Number_Of_Runs became 1000, and it should be 1000000(all number become upper 4 digits only)

 Dhrystone Benchmark, Version 2.1 (Language: C)  Program compiled without 'register' attribute  Execution starts, 1 Dhrystone Benchmark, Version 2.1 (Language: C)  Program compiled without 'register' attribute  Execution starts, 1000 runs through Dhrystone Execution ends  Final values of the variables used in the benchmark:  Int_Glob:            5         should be:   5 Bool_Glob:           1         should be:   1 Ch_1_Glob:           A         should be:   A Ch_2_Glob:           B         should be:   B Arr_1_Glob[8]:       7         should be:   7 Arr_2_Glob[8][7]:    1000         should be:   Number_Of_Runs + 10 Ptr_Glob->   Ptr_Comp:          -6859         should be:   (implementation-dependent)   Discr:             0         should be:   0   Enum_Comp:         2         should be:   2   Int_Comp:          1         should be:   1   Str_Comp:          DRSOEPORM OESRN         should be:   DHRYSTONE PROGRAM, SOME STRING Next_Ptr_Glob->   Ptr_Comp:          -6859         should be:   (implementation-dependent), same as above   Discr:             0         should be:   0   Enum_Comp:         1         should be:   1   Int_Comp:          1         should be:   1   Str_Comp:          DRSOEPORM OESRN         should be:   DHRYSTONE PROGRAM, SOME STRING Int_1_Loc:           5         should be:   5 Int_2_Loc:           1         should be:   1 Int_3_Loc:           7         should be:   7 Enum_Loc:            1         should be:   1 Str_1_Loc:           DRSOEPORM 'TSRN         should be:   DHRYSTONE PROGRAM, 1'ST STRING Str_2_Loc:           DRSOEPORM 'DSRN         should be:   DHRYSTONE PROGRAM, 2'ND STRING  It tooks 2 seconds. CLOCKS_PER_SEC 1000 Number_Of_Runs 1000 User_Time 2908 Microseconds for one run through Dhrystone: -0348  Dhrystones per Second:                      452  VAX MIPS:                                   2  DMIPS/Mhz:                                   56792

So, what leads to be like this? I don't have a clue to solve this.

Below are what I added to the file link.ld:

//ram: dccm
  .lalign         :
  {
    . = ALIGN(4);
    PROVIDE( _data_lma = . );
  } >flash AT>flash 

  .dalign         :
  {
    . = ALIGN(4);
    PROVIDE( _data = . );
  } >ram AT>flash 

  .data          :
  {
    *(.rdata)
    *(.rodata .rodata.*)
    *(.gnu.linkonce.r.*)
    *(.data .data.*)
    *(.gnu.linkonce.d.*)
    . = ALIGN(8);
    PROVIDE( __global_pointer$ = . + 0x800 );
    *(.sdata .sdata.*)
    *(.gnu.linkonce.s.*)
    . = ALIGN(8);
    *(.srodata .srodata.*)
    . = ALIGN(8);
  } >ram AT>flash 

  . = ALIGN(4);
  PROVIDE( _edata = . );
  PROVIDE( edata = . );

startup.S:

    //add below text before call main
    la a0, _data_lma
    la a1, _data
    la a2, _edata
    bgeu a1, a2, 2f
1:
    lw t0, (a0)
    sw t0, (a1)
    addi a0, a0, 4
    addi a1, a1, 4
    bltu a1, a2, 1b
2:
wehob commented 4 years ago

If rodata and stack located in flash, the uart is fine but DMIPS is poor

  .rodata :
  {
    *(.rdata)
    *(.rodata .rodata.*)
    *(.gnu.linkonce.r.*)
  . = ALIGN(8);
  } >flash AT>flash 

...

.stack ORIGIN(flash) + LENGTH(flash) - __stack_size :
  {
    PROVIDE( _heap_end = . );
    . = __stack_size;
    PROVIDE( _sp = . );
  } >flash AT>flash 

If rodata in flash, stack in dccm, DMIPS is much better but uart is down If both of them are in dccm, uart will display nothing

agrobman commented 4 years ago

How can you place stack to flash? Flash is read only memory, I believe. I can suggest to review cmark test setups in provided TB.

wehob commented 4 years ago

Flash is the memory origin at 0x0 which is the location of the block memory(specified in the address editor of axi bd). I saw the axi have writing ports to block ram so it can write to the flash right? I will check the cmark test setups later thanks.