esa-tu-darmstadt / tapasco-riscv

RISC-V soft-core PEs for TaPaSCo
15 stars 12 forks source link

Flute SSITH_P2 core #35

Closed gulmezmerve closed 2 months ago

gulmezmerve commented 3 months ago

Hi

I have been using the Tapasco framework with SSITH_P2 core. Unfortunately, I am not able to get the print working. The PE traps when it accesses the memory provided by host https://github.com/esa-tu-darmstadt/tapasco-riscv/blob/5d1a235511fe37de134cc6e0e4e210387ea92955/programming/examples/PE/rv_pe.h#L16. It seems that the PE side doesn't have access to the host memory. I don't know how to debug this issue. Do you have any insight, or how we can verify it?

cahz commented 3 months ago

Can you verify that the address map of your core provides access to the infrastructore of tapasco-riscv from the RAM_OFFSET address and upwards?

gulmezmerve commented 3 months ago

Hi Thanks for the response. I have no idea how can I verify it :/ If you don't mind, I can be happy to get some guides

cahz commented 3 months ago

For the flute (32 bit), we changed the address map and also disabled the caches (return false)

https://github.com/esa-tu-darmstadt/tapasco-riscv/blob/5d1a235511fe37de134cc6e0e4e210387ea92955/riscv/flute32/flute_tapasco.patch#L68-L74

For SSITH_P2 you need to do it similarly, but change it to our 64 bit offset address.

gulmezmerve commented 3 months ago

yeah I did like that. I tried also both mem0_controller_addr_range=h_8000_0000 and h_0001_0000_0000_0000. Both of them PE is not able to access the memory provided by host. But simple_sum examples are working. I just didn't able to get print working. It is needed for my benchmark.

diff --git a/src_Testbench/SoC/SoC_Map.bsv b/src_Testbench/SoC/SoC_Map.bsv
index da8d2d2..7e6c8b4 100644
--- a/src_Testbench/SoC/SoC_Map.bsv
+++ b/src_Testbench/SoC/SoC_Map.bsv
@@ -121,8 +121,8 @@ module mkSoC_Map (SoC_Map_IFC);
    // Near_Mem_IO (including CLINT, the core-local interruptor)

    let near_mem_io_addr_range = Range {
-      base: 'h_0200_0000,
-      size: 'h_0000_C000    // 48K
+      base: 'h_0001_0000,
+      size: 'h_0000_0000    // 48K
    };

    // ----------------------------------------------------------------
@@ -130,15 +130,15 @@ module mkSoC_Map (SoC_Map_IFC);

    let plic_addr_range = Range {
       base: 'h0C00_0000,
-      size: 'h0040_0000     // 4M
+      size: 'h0000_0000     // 4M
    };

    // ----------------------------------------------------------------
    // UART 0

    let uart0_addr_range = Range {
-      base: 'hC000_0000,
-      size: 'h0000_0080     // 128
+      base: 'h0010_0000,
+      size: 'h7FF0_0080     // 128
    };

    // ----------------------------------------------------------------
@@ -158,16 +158,16 @@ module mkSoC_Map (SoC_Map_IFC);
    // Boot ROM

    let boot_rom_addr_range = Range {
-      base: 'h_0000_1000,
-      size: 'h_0000_1000    // 4K
+      base: 'h_0000_0000,
+      size: 'h_0000_8000    // 4K
    };

    // ----------------------------------------------------------------
    // Main Mem Controller 0

    let mem0_controller_addr_range = Range {
-      base: 'h_8000_0000,
-      size: 'h_4000_0000    // 1 GB
+      base: 'h_0001_0000_0000_0000,
+      size: 'h_8000_0000    // 1 GB
    };

    // ----------------------------------------------------------------
@@ -195,7 +195,7 @@ module mkSoC_Map (SoC_Map_IFC);
       size: fromInteger(valueOf(RVFI_DII_Mem_Size))
    };
    function Bool fn_is_mem_addr (Fabric_Addr addr);
-       return (inRange(rvfi_cached, addr));
+       return False;
    endfunction
    function Bool fn_is_IO_addr (Fabric_Addr addr);
        return False;
@@ -207,10 +207,7 @@ module mkSoC_Map (SoC_Map_IFC);
    // (Caches need this information to cache these addresses.)

    function Bool fn_is_mem_addr (Fabric_Addr addr);
-       return (  inRange(boot_rom_addr_range, addr)
-         || inRange(mem0_controller_addr_range, addr)
-         || inRange(tcm_addr_range, addr)
-         );
+       return False; 
    endfunction

    // ----------------------------------------------------------------
@@ -219,9 +216,7 @@ module mkSoC_Map (SoC_Map_IFC);
    // (Caches need this information to avoid cacheing these addresses.)

    function Bool fn_is_IO_addr (Fabric_Addr addr);
-      return (   inRange(near_mem_io_addr_range, addr)
-              || inRange(plic_addr_range, addr)
-              || inRange(uart0_addr_range, addr));
+      return True;
    endfunction
 `endif
    // ----------------------------------------------------------------
cahz commented 3 months ago

Are you using it on a ZynqMP-based system (ZCU102)? Then you probably also need to allocate a memory buffer from the software side, otherwise the SMMU will block the accesses.

gulmezmerve commented 3 months ago

Yes, exactly I have been using zynq102.

I updated my comment @cahz

I understand from C++ API that I can do memory allocation from the PE side.

    tapasco_handle_t stdoutBuf_device;
    tapasco.alloc(stdoutBuf_device, sizeof(unsigned char) * STDOUT_BUF);

I later copy from the PE with

tapasco.copy_from(stdoutBuf_device, stdoutBuf, STDOUT_BUF); 

I am getting this error.

[2024-06-16T21:02:58Z ERROR tapasco::ffi] Setting LAST_ERROR: Error during Allocator operation: VFIO allocator requires va argument, none given
Error during Allocator operation: VFIO allocator requires va argument, none given
terminate called after throwing an instance of 'tapasco::tapasco_error'
  what():  Error during Allocator operation: VFIO allocator requires va argument, none given

Do you have any idea, it is an old api?

yannickl96 commented 3 months ago

I understand from C++ API that I can do memory allocation from the PE side.

Hi @gulmezmerve, just to clarify: Are you trying to use the TaPaSCo C++ API from the RISC-V core? That would be the wrong way round. You need to allocate all necessary shared buffers from the host software running on the ZynqMP PS, in the case of the ZCU102. You find that in the https://github.com/esa-tu-darmstadt/tapasco-riscv/blob/master/programming/examples/host/simple_sum/simple_sum_host.cpp#L56 where we allocate the array on the host side and then provide it to the RISC-V core here. makeWrappedPointer makes the buffer pointer available to the PE, makeOutOnly tells the runtime that it does not have to copy anything to the buffer upon job launch, only after the job finishes. Thus, you don't have to copy the buffer back to host explicitly. The runtime already does that for you. Passing the result as an argument to the launch writes the buffer address into the RVController ARG3 CSR which is accessed by the print function.

Just to be sure that everything is set up correctly and working as expected, could you try to run the simple_sum example using the Piccolo core? I suggest that you modify the PE example code to:

#include "../rv_pe.h"

int main()
{
    int a = readFromCtrl(ARG1);
    int b = readFromCtrl(ARG2);
    writeToCtrl(RETL, a + b);
    initPrint();
        print("Finished writing result to host.\n");

    setIntr();
    return 0;
}

Build the RISC-V binary and then copy it to the ZCU102 along with the simple_sum host example.

Building Piccolo (make sure Vivado is on your path and you sourced the TaPaSCo workspace setup script):

make piccolo32_pe
tapasco compose [piccolo32_pe x 1] @ 50 MHz -p zcu102

Host software

Make sure to set the BRAM_SIZE and the PE_ID macros in simple_sum_host.cpp correctly. The PE_ID for Piccolo is 1747. The default BRAM_SIZE as built in the previous commands is 0x8000.

#define PE_ID 1747

#define BRAM_SIZE 0x8000

Build the host software on the ZCU102 using CMake and execute it:

/path/to/simple_sum_host /path/to/simple_sum.bin

The output should look roughlty like this:

$ ./simple_sum_host ../../simple_sum.bin 
Finished reading binary file. Received 16796 bytes.
Waiting for RISC-V 
RISC-V return value: 1379
First program bytes: 0
RiscV STDOUT: Finished writing result to host.

Please let me know if that helped.

Best, Yannick

gulmezmerve commented 3 months ago

Hi

@yannickl96 Thanks for the reply. I did everything that you said. I am familiar with those. But the problem is that the PE side isn't able to write stdoutBuf; it gets interrupted. https://github.com/esa-tu-darmstadt/tapasco-riscv/blob/5d1a235511fe37de134cc6e0e4e210387ea92955/programming/examples/PE/rv_pe.h#L113. This memory is not accessible to PE side. That's why I was trying to do some memory allocator PE-local memory directly, and read it from there host side.

gulmezmerve commented 3 months ago

flute32_pe and flute64_pe work for me. But I have been trying to get print working with this core. https://github.com/CTSRD-CHERI/Flute/tree/CHERI/src_SSITH_P2. PE side is not able to access to stdoutBuf

yannickl96 commented 3 months ago

Are you able to kill the program execution from the ZCU102 or is everything freezing completely, i.e., requiring a complete cold restart of the board?

Another possibility to debug the issue is to add an ILA to your PE design and attach it to the RISC-V core's data memory port, the input and the output of the dmaOffset core and check if the stdout buffer address is used correctly along the entire memory path. If you add an ILA, you need to build the bitstream using:

tapasco compose [your_pe x 1] @ Freq MHz -p zcu102 --features 'Debug {enabled: true}'
gulmezmerve commented 3 months ago

I can kill the execution without doing the cold start. I am not familiar with ILA at the moment. :/

Except for the print function, I can run everything, read_args, or get the return value from https://github.com/CTSRD-CHERI/Flute/tree/CHERI/src_SSITH_P2. The only problem I have now is accessing the memory provided by the host side. That's why I was checking if I can put stdout buff PE-local memory and read from the host side, if it is feasible?

yannickl96 commented 3 months ago

In theory, you can do that. You just have to add a makeLocal around the makeOutOnly for the STDOUT buffer and make sure that the size of the buffer fits into your remaining data memory and the start address of the buffer is aligned to 64-bit boundaries on ZynqMP platforms. You then have to remove the + RAM_OFFSET in rv_pe.h since that will typically route to PE-external memory. The new job launch then looks like this:

auto job = tapasco.launch(
        peID,                                                  // Processing Element ID
        retval,                                                // return value
        program_buffer_in,                                     // Program is passed as Arg 0
        a,                                                     // Arg 1
        b,                                                     // Arg 2
        addOffset(0x6000, makeLocal(makeOutOnly(makeWrappedPointer(stdoutBuf, STDOUT_BUF)))) // Arg 3
    );

With the addOffset, we explicitly put the buffer at address 0x6000 in the PE-local memory.

Best, Yannick

gulmezmerve commented 3 months ago

Thanks for your guidance @yannickl96. It turns out that PE is not able to write its own memory local too. I defined a local array in the stack, that code doesn't work either. It is strange that the code is able to start; the simple_sum examples work. But PE isn't able to write to any memory, including its own.

int main() {
    initInterrupts();
    initPrint();

    int a = readFromCtrl(ARG1);
    int d = readFromCtrl(ARG2);
    volatile char b[10]; 
    volatile char *str = "H!\n";

    for (const char *c = str; *c; ++c, ++out_idx)
        b[out_idx] = *c;

    writeToCtrl(RETL, a+d);

    setIntr();
    return 0;

}

I compiled with this command make BRAM_SIZE=0x100000 flute64cheri_pe tapasco compose [flute64cheri_pe x 1 ] @100 MHz -p zcu102

and later application make SIZE=0x100000 PROGRAM=read_dm

yannickl96 commented 3 months ago

Okay, is the inability to write to its own local memory again indicated by a trap? I overlooked that you are working with a 64-bit version. Please try to recompile your program with make SIZE=0x100000 PROGRAM=read_dm RV64=1 so your compiler uses the correct march and mabi flags.

Apart from that it would be really interesting to see if memory requests actually get forwarded to the data memory bus. We have two routes that we can go from here: debugging hardware with the ILA or trying to get simulation up and running. Simulation may be helpful to get the $display statements inside the core. If the core traps we may be able to see if the problem was an error response from the memory bus or the internal address map.

On a different note: Why do you have fn_is_mem_addr twice in your SoC_Map patch? Once w.r.t. RVFI and once w.r.t. boot_rom, TCM and mem0_controller. Is the Bluespec compiler not complaining about that?

gulmezmerve commented 3 months ago

Okay, is the inability to write to its own local memory again indicated by a trap? I overlooked that you are working with a 64-bit version. Please try to recompile your program with make SIZE=0x100000 PROGRAM=read_dm RV64=1 so your compiler uses the correct march and mabi flags.

Yes, it indicates the trap, unfortunately. I am compiling it with clang because it supports the CHERI core. I tested my clang environment with vanilla riscv-64, and it works. I don't think it is a problem to compile with clang. As a side note, I just realized that if you compile with -O2, the print function is completely optimized away. It is good to have __attribute((optnone) for tapasco print!

On a different note: Why do you have fn_is_mem_addr twice in your SoC_Map patch? Once w.r.t. RVFI and once w.r.t. boot_rom, TCM and mem0_controller. Is the Bluespec compiler not complaining about that?

one of the fn_is_mem_addr is closed by if else . https://github.com/CTSRD-CHERI/Flute/blob/3fb6e6677ac92bf87f871038302d0153b3790885/src_Testbench/SoC/SoC_Map.bsv#L209

I will try to get work with simulator, hope I can manage it.

yannickl96 commented 3 months ago

As a side note, I just realized that if you compile with -O2, the print function is completely optimized away. It is good to have __attribute((optnone) for tapasco print!

Thank you very much for the hint!

I will try to get work with simulator, hope I can manage it.

Feel free to reach out for further assistance!

gulmezmerve commented 3 months ago

Hi;

Finally, I am able to run the questa simulator for flute64_pe, But I cannot find any example that how I will run software with on the simulator. I am missing that part probably. Make sure to select the correct TaPaSCo kernel-device in your software when instantiating the Tapasco Class/Structure.

As far as I understand, it shouldn't use the tlkm driver when we run it on the simulator.

Do you have an example code for host side that I give a try?

Best Merve

yannickl96 commented 3 months ago

Hi again! The simulator interacts with the entire TaPaSCo software stack, including the TLKM. Thus, you have to load the driver on the machine running the host software (not necessarily the machine running questa). The line you quoted is only relevant if you have several TaPaSCo devices connected to the machine running your host software. If you are running on a machine without any FPGA cards connected via PCIe, your host software itself does not change for the simulation.

gulmezmerve commented 3 months ago

Thanks for the reply!

My host machine actually has FPGA cards. How should I select that it should connect to the simulator? That makes me confused about how my host application can understand that it should connect to the simulator, not the FPGA itself.

yannickl96 commented 3 months ago

When you do ls -l /dev/ | grep tlkm you should get several results of the form tlkm_XX where XX is some number. The simulation device is the one with the highest number. You need to pass this number to the Tapasco constructor in your host application. Another possibility is to use

libtapasco_tests status

This command will print information for all devices, such as PEs in your design, etc, where you can check for the highest device ID again.

gulmezmerve commented 2 months ago

I couldn't get it working. I decided not pursing Tapasco for now.

Thanks for all reply