A few questions about VexiiRiscv L1 caches and Tilelink

petrmikheev commented 1 month ago

Trying to migrate my hobby project SoC from VexRiscv to VexiiRiscv. I suspect there is a bug in FetchL1Plugic:

After fetching first few commands FetchL1Plugin produces unknown ('xxxx) output if cache memory was not zero-initialized.

Simulation in iverilog:

After a few clock cycles 'xxxx propagates to registers in PcPlugin and simulation gets stuck.

But if I either set bootMemClear = true or disable cache (fetchL1Enable = false) in ParamSimple, it works fine.

Dolu1990 commented 4 weeks ago

Hi,

bootMemClear is the intended way to avoid those x-prop. Overall, as far i looked into the x-prop happening in Vexiiriscv (with bootMemClear = false), they all look like false bug only induced by x-prop creating impossible cases. In other words, x-prop creating simulation only bugs.

All the test i run VexiiRiscv on, are run by Verilator, which instead of x-prop, use 1 and 0 and randomize all the values at init.

I kinda don't feel great about putting bootMemClear by default, because it consume hardware ressources and timings. Maybe instead i should put a big warning message at the very end of the VexiiRiscv hardware generation to warn about it ?

Trying to migrate my hobby project SoC from VexRiscv to VexiiRiscv

Nice ^^ What SoC / memory bus are you working with ?

petrmikheev commented 4 weeks ago

In other words, x-prop creating simulation only bugs. All the test i run VexiiRiscv on, are run by Verilator, which instead of x-prop, use 1 and 0 and randomize all the values at init.

Thanks! Makes perfect sense then.

Maybe instead i should put a big warning message at the very end of the VexiiRiscv hardware generation to warn about it?

Can be enough to mention it in documentation, somewhere near "Run a simulation" section. It would save me some time.

What SoC / memory bus are you working with?

Previously used Axi4, now trying Tilelink. SpinalHDL components for Tilelink look very nice, but it is not easy to follow all the magic that happens underneath. For example struggling to get the meaning of

// from vexiiriscv/soc/litex/Soc.scala
vexii.lsuL1Bus.setDownConnection(a = withCoherency.mux(StreamPipe.HALF, StreamPipe.FULL), b = StreamPipe.HALF_KEEP, c = StreamPipe.FULL, d = StreamPipe.M2S_KEEP, e = StreamPipe.HALF)
vexii.dBus.setDownConnection(a = StreamPipe.HALF, d = StreamPipe.M2S_KEEP)

and understand why it should be written exactly this way.

By the way, what is the best place to ask questions? Here? Or maybe in SpinalHDL google group?

Dolu1990 commented 4 weeks ago

Can be enough to mention it in documentation, somewhere near "Run a simulation" section. It would save me some time.

Added :)

but it is not easy to follow all the magic that happens underneath

Yes, i realy need to provide a better tutorial how to use the tilelink API. I got a bit of funding to do it and should start that very soon.

vexii.lsuL1Bus.setDownConnection(a = withCoherency.mux(StreamPipe.HALF, StreamPipe.FULL), b = StreamPipe.HALF_KEEP, c = StreamPipe.FULL, d = StreamPipe.M2S_KEEP, e = StreamPipe.HALF)

So up would mean connections toward masters, down mean connections toward slaves. cpu -> .. -> up -> lsuL1Bus -> down -> ..-> peripherals

setDownConnection is a way to ask the interconnect to add some pipelining stages between lsuL1Bus and the address decoder which will serve the down connections.

a,b,c,d,e refer to the 5 tilelink channels.

StreamPipe.HALF is a lightweight pipeline stage which fully cut all combinatorial paths. StreamPipe.FULL is a "heavy weight" one which also cut all combinatorial path StreamPipe.M2S_KEEP is a ligtweight one which cut the valid + data path, but not the ready path

So all of this is related to timings optimization / synthesis results.

By the way, what is the best place to ask questions? Here? Or maybe in SpinalHDL google group?

Github issue are good, this repo is good aswell ^^

petrmikheev commented 3 weeks ago

I have a few probably stupid questions.

setDownConnection is a way to ask the interconnect to add some pipelining stages between lsuL1Bus and the address decoder which will serve the down connections.

Do I understand right that lsuL1Bus is the bus from cache to memory, and when cache is enabled dBus is used only for non-cachable regions? I.e.

     ->  fetch L1 cache -> iBus -> memory
cpu  ->  LSU L1 cache -> lsuL1Bus -> memory
     ->              dBus       -> IO regions

StreamPipe.HALF is a lightweight pipeline stage which fully cut all combinatorial paths. StreamPipe.FULL is a "heavy weight" one which also cut all combinatorial path StreamPipe.M2S_KEEP is a ligtweight one which cut the valid + data path, but not the ready path

In https://spinalhdl.github.io/SpinalDoc-RTD/master/SpinalHDL/Libraries/stream.html it is written that after halfPipe() the bandwidth divided by two. Half pipe on channel A then should mean that during all write operations it can send each beat of data only on every second clock cycle, and the bus works twice slower than it could. Seems I am missing something... Do maybe all writes go through channel C rather than A?

Also I can not figure out what param.lsuL1Coherency actually change. Let's take several use cases: 1) Tilelink bus with 2 masters (a. VexiiRiscv core with cache; b. video controller) and one slave (memory). Video controller makes only READ requests, but should trigger writeback if the requested data is cached and modified by the core. 2) Same as before, but add DMA controller to the same bus. It does both READ and WRITE operations (should trigger writeback & invalidate in VexiiRiscv cache), but DMA controller doesn't have its own cache. 3) Add to the same bus a second VexiiRiscv core that also has a cache.

Do all the use cases require lsuL1Coherency to be enabled? Or only the last one?

My current use case is the first one. I have video controller with Axi4ReadOnly interface and I am trying to figure out how to connect it to tilelink and make sure that cpu will writeback cached data when needed.

petrmikheev commented 3 weeks ago

By the way, what is the best place to ask questions? Here? Or maybe in SpinalHDL google group?

Github issue are good, this repo is good aswell ^^

I renamed the issue to better match the content :-)

Dolu1990 commented 3 weeks ago

Hi,

Do I understand right that lsuL1Bus is the bus from cache to memory, and when cache is enabled dBus is used only for non-cachable regions?

Yes

Do maybe all writes go through channel C rather than A?

Yes right for coherent caches. They send "aquire" requests on channel A (single beat), receive "aquire" responses on channel D (multi beat), and "release" dirty/clean cache lines on channel C.

Tilelink bus with 2 masters (a. VexiiRiscv core with cache; b. video controller) and one slave (memory). Video controller makes only READ requests, but should trigger writeback if the requested data is cached and modified by the core.

Right

Same as before, but add DMA controller to the same bus. It does both READ and WRITE operations (should trigger writeback & invalidate in VexiiRiscv cache), but DMA controller doesn't have its own cache.

Right

Do all the use cases require lsuL1Coherency to be enabled? Or only the last one?

All cases. As soon as you have multiple masters (cpu, dma), and a given CPU has a data cache, then that CPU need lsuL1Coherency. (or it will need to do cache flush/invalidate in software)

My current use case is the first one. I have video controller with Axi4ReadOnly interface and I am trying to figure out how to connect it to tilelink and make sure that cpu will writeback cached data when needed.

One reference for that is the litex SoC.scala.

What matter is that the CPU / DMA busses are merged together and then go into either a CacheFiber (l2), either into the HubFiber (no l2). Those CacheFiber / HubFiber will handle all the coherency channel (a,b,c,d,e) and provide to the memory system (DDR) only the regular channel to do get/put (a,d).

petrmikheev commented 3 weeks ago

Thanks for the answers!

(or it will need to do cache flush/invalidate in software)

By the way, does VexiiRiscv support any cache flush instructions? I've seen only fence.i and sfence.vma, but they are not about flushing particular cache lines.

And how fetch L1 interacts with lsu L1 if lsuL1Coherency is disabled? Does fence.i flushes the whole lsu L1 to ensure that iBus will see correct data?

What matter is that the CPU / DMA busses are merged together and then go into either a CacheFiber (l2), either into the HubFiber (no l2).

Ah! That's what I missed! If DMA uses only A,D channels and only simple get/put requests, will CacheFiber/HubFiber automatically generate flush/invalidate requests to CPU?

Dolu1990 commented 3 weeks ago

By the way, does VexiiRiscv support any cache flush instructions? I've seen only fence.i and sfence.vma, but they are not about flushing particular cache lines.

Currently, there is nothing to flush particular cache lines. Idealy, CMO would need to be implemented to allow VexiiRiscv usages without memory coherency. Todo ^^

Does fence.i flushes the whole lsu L1 to ensure that iBus will see correct data?

Yes

If DMA uses only A,D channels and only simple get/put requests, will CacheFiber/HubFiber automatically generate flush/invalidate requests to CPU?

Yes

Dolu1990 commented 3 weeks ago

The HubFiber will generate flush / invalidate for every request, while the CacheFiber has a dictionnary which track which cache has what, and so, will only generate usefull flush / invalidate requests

petrmikheev commented 3 weeks ago

Idealy, CMO would need to be implemented to allow VexiiRiscv usages without memory coherency. Todo ^^

CMO would be great! I hope it will also allow to avoid useless read-before-write requests to memory - for example when memcpy is going to fill the whole cache line with new data, but cpu decides to read this cache line from memory first.

The HubFiber will generate flush / invalidate for every request, while the CacheFiber has a dictionnary which track which cache has what, and so, will only generate usefull flush / invalidate requests

Interesting. Does it mean that if I use HubFiber it can make sense to connect iBus around it to reduce the number of invalidate requests?

                DMA  ------------------> TilelinkHub -\
    / ->  lsu L1 cache   -> lsuL1Bus -/                >--- memory
CPU | ->  fetch L1 cache -> iBus ---------------------/
    \ --------------------> dBus ----> IO

Or fence.i flushes the whole lsu L1 only if coherency is off?

Dolu1990 commented 3 weeks ago

I hope it will also allow to avoid useless read-before-write requests to memory - for example when memcpy is going to fill the whole cache line with new data, but cpu decides to read this cache line from memory first.

Ahhh that is because now, the cache is implemented as "write-back" So, cache line will always be readed, even if the CPU only write in it. It won't go away

Write-through cache don't have that issue, but instead, they spam the memory system with little write requests.

Interesting. Does it mean that if I use HubFiber it can make sense to connect iBus around it to reduce the number of invalidate requests?

Hmmm, if you have multiple CPU, you would then need to other CPU to manuly flush themself if they modify code sections. But else, yes.

fence.i always flush the whole D$ + I$ for both cases. Fondamentaly, when coherency is on, it wouldn't be necessary to flush D$

petrmikheev commented 3 weeks ago

So, cache line will always be readed, even if the CPU only write in it. It won't go away

I mean that there can be a special instruction that makes L1 cache think that the cache line is already loaded without an a actual read. I've looked through CMO spec and see that it doesn't have it. But can be a custom instruction.

char *src = ..., *dst = ...;
for (int i = 0; i < size / 64; ++i) {
  fakeReadCacheLine(dst);  // custom instruction
  for (int j = 0; j < 64; j += 4) *(int*)(dst + j) = *(int*)(src + j);
  src += 64, dst += 64;
}

Dolu1990 commented 3 weeks ago

Ahhhhh lol, i didn't know about it. Yes there could be a custom instruction for that.

petrmikheev commented 3 weeks ago

I am stuck with it. Could you please help to figure out why it doesn't work?

I want to connect a blackbox to a tilelink bus. It will use only 256-byte read requests.

class VideoController extends BlackBox {
  val io = new Bundle {
    ...
    val tl_bus = master(tilelink.Bus(tilelink.BusParameter(
      addressWidth = 32,
      dataWidth    = 64,
      sizeBytes    = 256,
      sourceWidth  = 0,
      sinkWidth    = 0,
      withBCE      = false,
      withDataA    = false,
      withDataB    = false,
      withDataC    = false,
      withDataD    = true,
      node         = null
    )))
  }
  noIoPrefix()
  mapClockDomain(clock=io.clk, reset=io.reset)
}

  val coherent_bus = tilelink.fabric.Node().forceDataWidth(64)
  val mem_bus = tilelink.fabric.Node().forceDataWidth(64)

  val tilelink_hub = new tilelink.coherent.HubFiber()
  tilelink_hub.up << coherent_bus
  mem_bus << tilelink_hub.down

  val video_ctrl = new VideoController()
  val video_bus = tilelink.fabric.Node.down()
  val video_bus_fiber = fiber.Fiber build new Area {
    video_bus.m2s forceParameters tilelink.M2sParameters(
      addressWidth = 32,
      dataWidth = 64,
      masters = List(
        tilelink.M2sAgent(
          name = video_ctrl,
          mapping = List(
            tilelink.M2sSource(
              id = SizeMapping(0, 1),
              emits = tilelink.M2sTransfers(get = tilelink.SizeRange(256, 256))
            )
          )
        )
      )
    )
    video_bus.s2m.supported load tilelink.S2mSupport.none()
    video_bus.bus << video_ctrl.io.tl_bus
  }
  coherent_bus << video_bus  // this line causes error

The error (the first one, there are many of them) is

[error]  Error detected in phase PhaseCreateComponent
[error] ********************************************************************************
[error] ********************************************************************************
[error] 
[error] Bundle assignment is not complete. toplevel/video_bus_noDecoder_toDown_d_payload : ChannelD need 'data' but toplevel/video_bus_to_coherent_bus_up_bus_d_payload : ChannelD doesn't provide it.
[error]     spinal.lib.Stream.connectFrom(Stream.scala:317)
[error]     spinal.lib.Stream.$less$less(Stream.scala:117)
[error]     spinal.lib.Stream.$greater$greater(Stream.scala:122)
[error]     spinal.lib.bus.tilelink.Bus.$less$less(Bus.scala:418)
[error]     spinal.lib.bus.tilelink.fabric.Node$$anon$2$$anon$6.<init>(Node.scala:213)
[error]     spinal.lib.bus.tilelink.fabric.Node$$anon$2.$anonfun$noDecoder$1(Node.scala:207)
[error]     spinal.lib.bus.tilelink.fabric.Node$$anon$2.<init>(Node.scala:207)
[error]     spinal.lib.bus.tilelink.fabric.Node.$anonfun$thread$1(Node.scala:77)

I used https://spinalhdl.github.io/SpinalDoc-RTD/master/SpinalHDL/Libraries/Bus/tilelink/tilelink_fabric.html#example-cpufiber as an example.

Dolu1990 commented 3 weeks ago

Ahhhh i think the reason is that the memory coherency hub / l2 can't handle memory requests bigger than one cache line. So no memory request more than 64 bytes is allowed.

So either :

you need to tilelink.SizeRange(64)
Would need to design a bridge to split memory 256 bytes memory accesses into 64 bytes ones, but overall, i think the best is realy to stick to "no more than 64 bytes". That allows to stream line a lot of things

petrmikheev commented 3 weeks ago

Yes, changed to 64 and now it successfully generates verilog! Thanks!

By the way, is it possible to generate a more explanatory error message in such situation? No idea how would I debug it without your help.

Dolu1990 commented 3 weeks ago

I will take a look tomorrow ^^

The situation right now is that Hub filter memory accesses to only allow the compatible one, which create a video DMA with no memory access at all, => no read / no write => withDataD = false

petrmikheev commented 3 weeks ago

Would need to design a bridge to split memory 256 bytes memory accesses into 64 bytes ones, but overall, i think the best is realy to stick to "no more than 64 bytes". That allows to stream line a lot of things

The only reason to use big requests was that in this case memory throughput is a bit higher. But it is not that significant.

Dolu1990 commented 3 weeks ago

One thing, i recently updated the documentation, in particular around the MicroSoc, you may find interresting things there, in particular : https://spinalhdl.github.io/VexiiRiscv-RTD/master/VexiiRiscv/Soc/microsoc.html#adding-a-custom-peripheral

I recently added a tilelink video DMA, see : https://github.com/SpinalHDL/VexiiRiscv/blob/dev/src/main/scala/vexiiriscv/soc/litex/Soc.scala#L152

Good luck :D

petrmikheev commented 3 weeks ago

Thanks!

I used both MicroSoc and litex SoC as examples.

Things like demo.node at 0x10003000 of bus32 look as magic. It is really nice and convenient. But when I do something wrong and the magic breaks, it is sometimes hard to dig into details. So for now my preferred approach is to write custom modules in verilog and connect them together using SpinalHDL.

Dolu1990 commented 3 weeks ago

But when I do something wrong and the magic breaks, it is sometimes hard to dig into details

Yes i agree. Error reporting isn't great when it goes into Fiber / Tilelink stuff.

For the Raw SpinalHDL things using component and so on, it should be good :D

petrmikheev commented 3 weeks ago

fence.i always flush the whole D$ + I$ for both cases.

Actually fence.i doesn't flush D$. At least if coherency is enabled. Checked experimentally. So I have to connect iBus to the tilelink hub rather than to memory directly.

Dolu1990 commented 3 weeks ago

Ahhh right, i just checked the code again, when coherency is enabled, the LsuPlugin will instead just wait that the storebuffer are drained out.

    val coherentFlusher = l1.coherency.get generate new Area{
      invalidationPorts.foreach(_.cmd.ready := withStoreBuffer.mux(storeBuffer.empty, True))
    }

petrmikheev commented 5 days ago

Hi! Sometimes a have deadlocks somewhere in my SoC. I suppose they are caused by attempts to read or write invalid memory regions that are not connected to any slave interfaces. The problem is that I don't have a way to debug it: in my setup there is no integration with standard tools and no easy way for things like logic analyzer over jtag. And scenarios are too complicated for simulation (e.g. boot linux and run gcc on device).

Is there a way to add to tilelink hub a default slave interface that will be used if the address is not mapped to anything else? I want to add a module that catches all invalid requests and sends debug information via UART.

Dolu1990 commented 4 days ago

Hi,

Yes it is by design. The memory masters have ways to figure out at hardware elaboration the exact memory mapping. So the idea is that the masters check that in hardware, and only issue legal memory requests.

So, what is done for DMA which doesn't integrate those check, is to directly add on their DMA bus a spinal.lib.bus.tilelink.fabric.TransferFilter As : https://github.com/SpinalHDL/VexiiRiscv/blob/9f067d3befb86c33c31c02635aa0f7f11b0bf26a/src/main/scala/vexiiriscv/soc/litex/Soc.scala#L233

Is there a way to add to tilelink hub a default slave interface that will be used if the address is not mapped to anything else?

Hmm that TransferFilter behave like a default slave. The difference is that it intercept the trafic, like a bridge (instead of behaving like a regular peripheral)

SpinalHDL / VexiiRiscv

A few questions about VexiiRiscv L1 caches and Tilelink #31