Closed wehob closed 4 years ago
Make sure the compiler is invoked with the -O2 flags. Default compilation produces non-optimized code. To get decent performance you have to use -O2 as allowed/required by the benchmark.
I had added the -O2 flags, but the time spent had increased to 358 seconds.
what frequency is your fpga design running? Where do you place data/stack - make sure data & stack are located inside DCCM, you need to scale DMIPS numbers with frequency.
Check that your code is placed to a region marked as cacheable in MRAC register and your design have IC enabled ...
You probably need review/update your crt0 code and write 0x55555555 to MRAC CSR to enable caching: li t0,0x55555555 csrw mrac, t0
prior calling main
Frequency is 40Mhz My configuration is
./swerv.config -set reset_vec=0xf0090000 -set=iccm_enable=1
-unset=icache_enable -iccm_region=0xf -iccm_offset=0x90000
-iccm_size=64 -dccm_region=0xf -dccm_offset=0x80000
-dccm_size=64 -btb_size=512 -bht_size=2048
so I changed link.ld to be like
MEMORY
{
iccm (wxa!ri) : ORIGIN = 0x00090000, LENGTH = 64k
dccm (wxa!ri) : ORIGIN = 0x00080000, LENGTH = 64k
}
SECTIONS
{
__stack_size = DEFINED(__stack_size) ? __stack_size : 8K;
.text.init :
{
*(.text.init)
. = ALIGN(8);
} > iccm
.text :
{
*(.text.unlikely .text.unlikely.*)
*(.text.startup .text.startup.*)
*(.text .text.*)
*(.gnu.linkonce.t.*)
. = ALIGN(8);
} > iccm
.rodata :
{
*(.rdata)
*(.rodata .rodata.*)
*(.gnu.linkonce.r.*)
. = ALIGN(8);
} > dccm
.data :
{
*(.data .data.*)
*(.gnu.linkonce.d.*)
. = ALIGN(8);
} > dccm
the left section .sdata, .sbss, .bss, .stack are all > dccm
But the GDB cannot read correctly and displayed
0x00000000 in ?? ()
which is not like what it was
_start () at /home/vdsl/swerv_eh1_fpga/software/bsp/startup.S:26
How do you load data/code to DCCM/ICCM before run? Debugger with openOCD may not know how to access the CCMs
How to load data/code to DCCM/ICCM is specified in link.ld or startup.S files? My link.ld file has changed to what I previously said(other section were remain same),and startup.S is the original file that github provided.
If you put the iccm and dccm in region 0xf, then your memory map should be like this:
MEMORY { iccm (wxa!ri) : ORIGIN = 0xf0090000, LENGTH = 64k dccm (wxa!ri) : ORIGIN = 0xf0080000, LENGTH = 64k }
I have tried that version of link.ld but still not work. The other issue says that it must use dma to load/store from iccm, is it correct? At least in my case.
I have tried that version of link.ld but still not work. The other issue says that it must use dma to load/store from iccm, is it correct? At least in my case.
This is correct, CPU has no internal data path from ICCM to LSU (load/store unit)
I think solution to my problem might be adding the DMA (cause LSU cannot connect CCMs) Below are new questions:
What ( memories) do you have connected to the CPU buses ? How do you download your test to the HW? Most likely you have an external memory slave near address 0.
If you have no DMA in you system, there are two ways to use DCCM:
1) Copy initial values of data variables from external memory by the CPU to DCCM in startup code;
2) Use debugger to preload DCCM ( debugger needs to support abstract memory commands) To use ICCM without DMA there are also two ways:
1) Your system should have data path in the interconnect from LSU to DMA CPU port, then you can use CPU code to copy from external memory to ICCM;
2) Use debugger to preload ICCM ( but it should use abstract memory commands to be able to access ICCM)
However, performance numbers for Dhrystone differ not significantly if the code is cacheable or resides in ICCM.
I think solution to my problem might be adding the DMA (cause LSU cannot connect CCMs) Below are new questions:
I finally finished loading data to dccm by startup and the DMIPS became 1.9 But the uart became like this
Dhrystone Benc
Execution endsustruhDrsoe:C
C__lb Balsue ntebnhak
sol e B
DISMz 17712 bv 7
And if I didn't move rodata to dccm it became a little better
Dhrystone Benchmark, Version 2.1 (Language: C)
Execution endsts, 1000 runs through Dhrystone)
DMIPS/Mhz: 56792Gme as above
And if I removed all "\n", it can display all message but my Number_Of_Runs became 1000, and it should be 1000000(all number become upper 4 digits only)
Dhrystone Benchmark, Version 2.1 (Language: C) Program compiled without 'register' attribute Execution starts, 1 Dhrystone Benchmark, Version 2.1 (Language: C) Program compiled without 'register' attribute Execution starts, 1000 runs through Dhrystone Execution ends Final values of the variables used in the benchmark: Int_Glob: 5 should be: 5 Bool_Glob: 1 should be: 1 Ch_1_Glob: A should be: A Ch_2_Glob: B should be: B Arr_1_Glob[8]: 7 should be: 7 Arr_2_Glob[8][7]: 1000 should be: Number_Of_Runs + 10 Ptr_Glob-> Ptr_Comp: -6859 should be: (implementation-dependent) Discr: 0 should be: 0 Enum_Comp: 2 should be: 2 Int_Comp: 1 should be: 1 Str_Comp: DRSOEPORM OESRN should be: DHRYSTONE PROGRAM, SOME STRING Next_Ptr_Glob-> Ptr_Comp: -6859 should be: (implementation-dependent), same as above Discr: 0 should be: 0 Enum_Comp: 1 should be: 1 Int_Comp: 1 should be: 1 Str_Comp: DRSOEPORM OESRN should be: DHRYSTONE PROGRAM, SOME STRING Int_1_Loc: 5 should be: 5 Int_2_Loc: 1 should be: 1 Int_3_Loc: 7 should be: 7 Enum_Loc: 1 should be: 1 Str_1_Loc: DRSOEPORM 'TSRN should be: DHRYSTONE PROGRAM, 1'ST STRING Str_2_Loc: DRSOEPORM 'DSRN should be: DHRYSTONE PROGRAM, 2'ND STRING It tooks 2 seconds. CLOCKS_PER_SEC 1000 Number_Of_Runs 1000 User_Time 2908 Microseconds for one run through Dhrystone: -0348 Dhrystones per Second: 452 VAX MIPS: 2 DMIPS/Mhz: 56792
So, what leads to be like this? I don't have a clue to solve this.
Below are what I added to the file link.ld:
//ram: dccm
.lalign :
{
. = ALIGN(4);
PROVIDE( _data_lma = . );
} >flash AT>flash
.dalign :
{
. = ALIGN(4);
PROVIDE( _data = . );
} >ram AT>flash
.data :
{
*(.rdata)
*(.rodata .rodata.*)
*(.gnu.linkonce.r.*)
*(.data .data.*)
*(.gnu.linkonce.d.*)
. = ALIGN(8);
PROVIDE( __global_pointer$ = . + 0x800 );
*(.sdata .sdata.*)
*(.gnu.linkonce.s.*)
. = ALIGN(8);
*(.srodata .srodata.*)
. = ALIGN(8);
} >ram AT>flash
. = ALIGN(4);
PROVIDE( _edata = . );
PROVIDE( edata = . );
startup.S:
//add below text before call main
la a0, _data_lma
la a1, _data
la a2, _edata
bgeu a1, a2, 2f
1:
lw t0, (a0)
sw t0, (a1)
addi a0, a0, 4
addi a1, a1, 4
bltu a1, a2, 1b
2:
If rodata and stack located in flash, the uart is fine but DMIPS is poor
.rodata :
{
*(.rdata)
*(.rodata .rodata.*)
*(.gnu.linkonce.r.*)
. = ALIGN(8);
} >flash AT>flash
...
.stack ORIGIN(flash) + LENGTH(flash) - __stack_size :
{
PROVIDE( _heap_end = . );
. = __stack_size;
PROVIDE( _sp = . );
} >flash AT>flash
If rodata in flash, stack in dccm, DMIPS is much better but uart is down If both of them are in dccm, uart will display nothing
How can you place stack to flash? Flash is read only memory, I believe. I can suggest to review cmark test setups in provided TB.
Flash is the memory origin at 0x0 which is the location of the block memory(specified in the address editor of axi bd). I saw the axi have writing ports to block ram so it can write to the flash right? I will check the cmark test setups later thanks.
I tried to run the swerv core with Configure in "SweRV_CoreMark_Benchmarking.pdf" on genesys2 fpga board, and it could successfully run the hello example through openocd; But when I tried to run the dhrystone with "number of runs" = 1000000 through openocd, it took 334 seconds and the DMIPS was very low.
So, could somebody tell me where the question is? The change all I made was:
The dryhstone code was from https://github.com/sifive/benchmark-dhrystone I copied the makefile from hello example for dryhstone Deleting -nostdlib and change all hello.c to dhry_1.c dhry_2.c Thanks!