SpinalHDL / VexRiscv

A FPGA friendly 32 bit RISC-V CPU implementation
MIT License
2.52k stars 420 forks source link

Data Stream in/out SoC <-> FPGA #391

Open lk-davidegironi opened 9 months ago

lk-davidegironi commented 9 months ago

Hello,

I'm building a softcore based on Briey. I've the AxiCrossbar without Sdram and on the APB3Bridge 1 Timer, 1 UartCtrl, 1 Gpio. On the APB3 I would like to add a custom Ctrl to transfer in and output data stream. I've to transfer 32x24bit data in + 32x16bit data output, and do some simple (for now) math inside the SoC on that stream.

I've try setting a Vector of Bits and now I'm trying Axi4Stream, based on that code here: https://github.com/SpinalHDL/VexRiscv/pull/53 In both options (Vector and Axi4Stream), I've a couple of issue, before moving to issues, I would like to ask you which could be a good option.

As far as I understand using Vec means using more LUT cause all the signals have to be synthetized, but this way I can read and drive signals in a few clock. Indeed using the Axi Stream is a bit slower, but use less LUT, am I right?

Depending on your suggestion I'll ask questions, thanks!

Dolu1990 commented 9 months ago

Hi,

As far as I understand using Vec means using more LUT cause all the signals have to be synthetized

Yes right, that's not the way to go.

Indeed using the Axi Stream is a bit slower, but use less LUT, am I right?

Yes, serializing things is a better tradeoff i think, then eventualy buffering things into a Mem (ram) for later reuse.

lk-davidegironi commented 9 months ago

Thank you @Dolu1990 , So, going to the Streaming mode, I'm using code taken from here: https://github.com/SpinalHDL/VexRiscv/pull/53 I'm trying, without success to make it work with readStreamNonBlocking. I've to investigate further on the valid and read signal. I'll keep you updated. Any further help is appreciated.

Apb3Axis is connected in the main SoC scala code, the Apb3Axis class looks like above

package lk.lib

import spinal.core._
import spinal.lib._
import spinal.lib.bus.amba3.apb.{Apb3, Apb3Config, Apb3SlaveFactory}

case class Apb3Axis(apb3Config: Apb3Config) extends Component {
  val io = new Bundle {
    val apb = slave(Apb3(apb3Config))
    val input = slave(Stream(Bits(32 bits)))
    val output = master(Stream(Bits(32 bits)))
  }

  val ctrl = Apb3SlaveFactory(io.apb)

  // input stream is by readStreamNonBlocking, but is not working, indeed comment code by streamfifo is working
  ctrl.readStreamNonBlocking(io.input.queue(128), address = 0)
  //val ififo = StreamFifo(dataType = Bits(32 bits), depth = 128)
  //ififo.io.push << io.input
  //ctrl.read(ififo.io.pop.payload, address = 0);
  //val ififoPopReady = ctrl.drive(ififo.io.pop.ready, address = 4)
  //ctrl.read(ififo.io.pop.valid, address = 8);
  //when(ififo.io.pop.valid) { ififoPopReady := False }

  val wordCount = (1 + widthOf(io.input.payload) - 1) / 32 + 1
  val wordAddressInc = 32 / 8
  val addressHigh = 0 + (2 - 1) * wordAddressInc
  SpinalInfo("Wordcount: " + wordCount)
  SpinalInfo("addressHigh: " + addressHigh)

  // output stream is by streamfifo, but needs to be converted to createAndDriveFlow
  val ofifo = StreamFifo(dataType = Bits(32 bits), depth = 128)
  ofifo.io.pop >> io.output
  ctrl.drive(ofifo.io.push.payload, address = 12)
  val ofifoPushValid = ctrl.drive(ofifo.io.push.valid, address = 16)
  ctrl.read(ofifo.io.push.ready, address = 20)
  when(ofifo.io.push.ready) { ofifoPushValid := False }
  //val writeFlow = ctrl.createAndDriveFlow(Bits(32 bits), address = 0)
  //writeFlow.toStream.stage() >> ofifo.io.push

}

main C sample code is below


typedef struct
{
  volatile uint32_t IN_DATA;
  volatile uint32_t IN_READY;
  volatile uint32_t IN_VALID;
  volatile uint32_t OUT_DATA;
  volatile uint32_t OUT_VALID;
  volatile uint32_t OUT_READY;
} AXIS_Reg;
#define AXIS ((AXIS_Reg *)(0xF0060000))

// inside the main function
while (1)
    {
        while (AXIS->IN_VALID == 0)
        {
            asm volatile("");
        }
        AXIS->OUT_DATA = 3 + AXIS->IN_DATA;
        AXIS->OUT_VALID = 0xFFFF;
        while (AXIS->OUT_VALID != 0)
        {
            asm volatile("");
        }
        AXIS->IN_READY = 0xFFFF;
        while (AXIS->IN_READY != 0)
        {
            asm volatile("");
        }
    }

then the verilog top function syntetized I'm looking the the verilog analizer at axis_input_payload and axis_output_payload; they works (output is input +3 each ticks) if input stream is implemented using streamfifo (the commented code of Apb3Axis), indeed does not work using readStreamNonBlocking


 //tick_1s_tick ticks every 1 second

  reg axis_input_valid;
  wire axis_input_ready;
  reg [31:0] axis_input_payload;
  wire axis_output_valid;
  reg axis_output_ready;
  wire [31:0] axis_output_payload;

    always @(posedge clk)
    begin
        if(tick_1s_tick)
        begin
            axis_input_valid <= 1'b1;
            axis_input_payload <= samplePayload;
        end
        else
        begin
            axis_input_valid <= 1'b0;
        end

        if(axis_output_valid)
        begin
            axis_output_ready <= 1'b1;
        end
        else
        begin
            axis_output_ready <= 1'b0;
        end

    end

    Soc Soc_inst(
        // all the other signals
        .io_axis_input_valid(axis_input_valid),
        .io_axis_input_ready(axis_input_ready),
        .io_axis_input_payload(axis_input_payload),
        .io_axis_output_valid(axis_output_valid),
        .io_axis_output_ready(axis_output_ready),
        .io_axis_output_payload(axis_output_payload)
    );
lk-davidegironi commented 9 months ago

I was able to make it works, but I've performance problem.

So, first. How i make this work using readStreamNonBlocking and createAndDriveFlow, using a 31 bit payload.

Apb3Axis looks now like below:

case class Apb3Axis(apb3Config: Apb3Config) extends Component {
  val io = new Bundle {
    val apb = slave(Apb3(apb3Config))
    val input = slave(Stream(Bits(31 bits)))
    val output = master(Stream(Bits(31 bits)))
  }

  val busCtrl = Apb3SlaveFactory(io.apb)

  val ioinputqueue = io.input.queueLowLatency(128)
  busCtrl.readStreamNonBlocking(
    ioinputqueue,
    address = 0,
    validBitOffset = 31,
    payloadBitOffset = 0
  )

  val ofifo = StreamFifoLowLatency(dataType = Bits(31 bits), depth = 128)
  ofifo.io.pop >> io.output
  val writeFlow = busCtrl.createAndDriveFlow(Bits(31 bits), address = 4)
  writeFlow.toStream.stage() >> ofifo.io.push
}

software side (almost like below) - notice I'm using a Gpio output to check the timing on the the analyzer

typedef struct
{
  volatile uint32_t IN_PAYLOAD;
  volatile uint32_t OUT_PAYLOAD;
} AXIS_Reg;
#define AXIS ((AXIS_Reg *)(0xF0060000))

#define IN_PAYLOAD_VALID_MASK 0x80000000
#define IN_PAYLOAD_VALID_SHIFT 31
#define IN_PAYLOAD_DATA_MASK 0x7FFFFFFF
#define IN_PAYLOAD_DATA_SHIFT 0

//
// in main function, main while loop
//
    while (1)
    {

        uint32_t payload = AXIS->IN_PAYLOAD;
            if ((payload & IN_PAYLOAD_VALID_MASK) >> IN_PAYLOAD_VALID_SHIFT == 1) {
          uint32_t data = (payload & IN_PAYLOAD_DATA_MASK) >> IN_PAYLOAD_DATA_SHIFT;
          if (data == 10) {
              gpioA_setOutputBit(0);
              gpioA_clearOutputBit(0);
          }
          // AXIS->OUT_PAYLOAD = data;
        }
    }

Verilog side

reg axis_input_valid;
  wire axis_input_ready;
  reg [30:0] axis_input_payload;
  wire axis_output_valid;
  reg axis_output_ready;
  wire [30:0] axis_output_payload;
  reg [30:0] axis_output_payload_reg;

  reg [30:0] signaldata_reg;
  reg [30:0] signalretdata_reg;
  initial  signaldata_reg = 0; 

    always @(posedge clk)
    begin
       // enable signal send every 1 second (it will be at 10kHz in the future)
        if(tick_1s_tick)
        begin
            signalnum_reg <= 1;
        end

       // check sending 32 data payload
        if(signaldata_reg >= 1 && signaldata_reg <= 32 && axis_input_ready)
        begin
            signaldata_reg <= signaldata_reg + 1;
            axis_input_valid <= 1'b1;
            axis_input_payload <= signaldata_reg;
        end
        else
        begin
            axis_input_valid <= 1'b0;
        end

        // receiving output and moving to data
        axis_output_ready <= 1'b1;
        if(axis_output_valid)
        begin
            signalretdata_reg <= axis_output_payload;
        end
    end

   Soc Soc_inst(
        // all the other signals
        .io_axis_input_valid(axis_input_valid),
        .io_axis_input_ready(axis_input_ready),
        .io_axis_input_payload(axis_input_payload),
        .io_axis_output_valid(axis_output_valid),
        .io_axis_output_ready(axis_output_ready),
        .io_axis_output_payload(axis_output_payload)
    );

Code above works with StreamFifo and .queue. I've find no difference using that or the LowLatency one. I'm running that code on a Briey based Soc, running @ 72Mhz main clk on a Tang Primer 20k.

In the future I'm going to use a payload of 31 bit, then I'll put a command in the first 7 bit, and use the other 24 for data.

My problem is about performance. I've try StreamFifo and .queue instead of queueLowLatency and StreamFifoLowLatency but it makes not difference. I've used the gpio output to measure how much time it takes for a signal to be read (or sent). it seems reading a signal takes many cycles. If you look at the 1024 cycles capture below you will notice the gpio I/O happened almost at cycle 530, for data number 10. It means if I have to send 32 data payload it will takes 1700 cycle almost. That is some kind of too much for my requirements. I've to send data hopefully at 10kHz. it means I've 7200 cycles to to make math (simple math) each loop in my software.

StreamFifo and .queue instead of queueLowLatency and StreamFifoLowLatency makes not difference. Running the SoC at 72Mhz or 12Mhz makes not difference. FPGA/Briey and busses (AXI+APB3) are all running within the same clock domain).

Do I miss something?

Note for image (here I'm using a payload that contains a command in the first 7 bit, and use the other 24 for data).

Capture

Writing data back to output 'AXIS->OUT_PAYLOAD = xxx ' makes no difference in timing, that means SoC to FPGA is fast. It's just the input a little too slow for me.

Capture2

Thanks for help.

Dolu1990 commented 9 months ago

Hi, looking at your simulation, it show things from time 0 right ? Thing is, the CPU will need a bit of time to reach the while loop.

lk-davidegironi commented 9 months ago

Thank you @Dolu1990 So, analyzer it's triggered at tick_1s_tick edge. And it's showing from time 0. If you look at signnum_reg you will find the 32 signals loaded to the StreamFifo in 32 cycles (find zoomed below). I'm testing the InterruptCtrl but this will not make difference now, cause in the while loop I'm always reading.

Zoomed in the other image the a load (uint32_t payload = AXIS->IN_PAYLOAD;) + unload (AXIS->OUT_PAYLOAD = data;) timing. It's almost 90 cycles.

Maybe something involving DMA can help? Sorry for my dumbness but I've just entered the FPGA SoC world.

I know I'm asking a lot from this core. My plan is to make this works on VexRiscV (even at slower speed), cause I like this project (portable and customizable), then when I'll be ready maybe moving to an hardware core (xilinx ARM) changing the busses of course.

Verilog below contains the actual payload content (signal number + integer), and will clarify it to you:

always @(posedge clk)
    begin
        if(tick_1s_tick)
        begin
            signalnum_reg <= 1;
        end
        if(signalnum_reg >= 1 && signalnum_reg <= 32)
        begin

            if(axis_input_ready)
            begin
                signalnum_reg <= signalnum_reg + 1;
                signaldata_reg <= signaldata_reg + 1;
            end

            axis_input_valid <= 1'b1;
            axis_input_payload <= {signalnum_reg, signaldata_reg};
        end
        else
        begin
            axis_input_valid <= 1'b0;
        end

        axis_output_ready <= 1'b1;
        if(axis_output_valid)
        begin
            signalretnum_reg <= axis_output_payload[30-:7];
            signalretdata_reg <= axis_output_payload[23:0];
        end

    end

Capture Capture2

Dolu1990 commented 9 months ago

Hmm, one thing to be carefull about aswell, is that the first attempt you will hit i$ d$ refills, so to take mesurements, you realy have to run the code more than once, and then take mesurment of the last execution.

Are your picture from the very first execution ? Was your code compiled in O3 ?