Nax not trap on store fault

Hi Charles,

To follow up on the discussion in this pull request #115

As I understand it, when the processor is in baremetal mode, and makes a store to a non-peripheral address, the request will pass through the cache.

If this address is not mapped in memory, the tilelink bus will send an error to the cache during refill. The cache will store the line loaded with the fault tag set to true.

But during the refill, the LSU does not trigger a trap and the store instruction is committed. Moreover, the store will be cached, even if the line is faulty.

In your opinion, isn't this behaviour a functional bug ? Even if I understand that the store instruction won't have any effect architecturally, because of the faulty cache line. But the fact of committing a store instruction to a non-existent address troubles me, and impacts cache performance by loading a line for nothing.

Would it be possible to solve this by adding a memRegion argument to the MMU ?

This argument could be used to trigger a trap when the address is outside the range, if translation is not active, and also to abort the cache request with the translated.abort signal.

We could use this line, which already exists in the code : https://github.com/SpinalHDL/NaxRiscv/blob/ba63ee6dfb063e0e6b1c8da51071a85dea9f934b/src/main/scala/naxriscv/misc/MmuPlugin.scala#L327

Moreover, would this change make store-to-load forwarding possible ? #43

Perhaps I didn't see a corner case that would make this modification impossible, thank you in advance for your feedback !

As I understand it, when the processor is in baremetal mode, and makes a store to a non-peripheral address, the request will pass through the cache.

Depend what you mean by pass through the cache.

In any privilege mode, all none-io memory store will to the cache. Just that cache line had a fault when it was loaded in the l1 cache, it is ignored (will not raise a CPU store access fault exception)

If this address is not mapped in memory, the tilelink bus will send an error to the cache during refill. The cache will store the line loaded with the fault tag set to true.

Yes

But during the refill, the LSU does not trigger a trap and the store instruction is committed. Moreover, the store will be cached, even if the line is faulty.

Right If you look in the DataCache.scala : io.store.rsp.fault := False //TODO

XD So it will be written I'm not sure if the l1 d$ will writeback the faulty line back to memory when it need to make room.

In your opinion, isn't this behaviour a functional bug ?

I don't know. The fact that it doesn't raise an exception on faulty store => not a bug The fact that the store still happen => Hmm not realy a bug, more like a lazy implementation which would not create side effects (=> OK)

But the fact of committing a store instruction to a non-existent address troubles me, and impacts cache performance by loading a line for nothing.

Hmm yes and no it depends. If the use define the ioRange as everything which isn't cached (including non existing regions) then the l1 cache will always be safe from non existing region. But VexiiRiscv handle this more cleanly by having explicitly 2 regions : ioRegion, l1Regions. So yeah, in naxriscv that isn't necessarly great, but still, defining everyhting non cached as part of the ioRange would work well i think.

Would it be possible to solve this by adding a memRegion argument to the MMU ?

Ahh right, a bit like vexii ^^ But the idea is to not integrate it into the MMU to allow things to stay decoupled. MMU => memory translation PMP + region check => on physical address after the MMU. So overall those regions checks should be done in the Lsu plugin itself

I know i integrated it into the MMU, but this isn't realy right. The reasion is that currently, there is 3 bus on Nax : i$ / d$ / peripheral And the thing is that each of those bus can access its own memory interconnect with different memory mapping. Which imply that each of those bus need its own definition of "what is accessible"

Overall, i would say as a workaround, just setting the ioRegion to everything which is not cached should be good.

Moreover, would this change make store-to-load forwarding possible ?

I'm not sure to understand the relation between the two ?

Thank you for your detailed reply !

XD So it will be written I'm not sure if the l1 d$ will writeback the faulty line back to memory when it need to make room.

Normally not, the faulty line will not be written in memory. As the Tilelink bus will send an rsp.error = true, and the writeback section of the L1 cache does not take this field into account, nothing will happen.

I don't know. The fact that it doesn't raise an exception on faulty store => not a bug The fact that the store still happen => Hmm not realy a bug, more like a lazy implementation which would not create side effects (=> OK)

I agree that the faulty store will not impact the memory, and its value cannot be loaded in registers afterwards (trigger a load access fault). But for me, as an exception does not raise, the execution flow deviates from the expected way (spike). If a program has a bug in its addressing, such as a stack overflow, the user doesn't get an error back like a segfault. Although I agree that this is a special case because we're in baremetal.

In cache memory, I see as a side effect the eviction of legal cache lines to fill the cache with faulty cache lines that will be unused. So it's more of a performance concern than a functional one. There are also, all the coherence transactions added.

Hmm yes and no it depends. If the use define the ioRange as everything which isn't cached (including non existing regions) then the l1 cache will always be safe from non existing region. But VexiiRiscv handle this more cleanly by having explicitly 2 regions : ioRegion, l1Regions. So yeah, in naxriscv that isn't necessarly great, but still, defining everyhting non cached as part of the ioRange would work well i think.

Glad to hear that Vexi is already incorporating this differentiation !

Ahh right, a bit like vexii ^^ But the idea is to not integrate it into the MMU to allow things to stay decoupled. MMU => memory translation PMP + region check => on physical address after the MMU. So overall those regions checks should be done in the Lsu plugin itself

I know i integrated it into the MMU, but this isn't realy right. The reasion is that currently, there is 3 bus on Nax : i$ / d$ / peripheral And the thing is that each of those bus can access its own memory interconnect with different memory mapping. Which imply that each of those bus need its own definition of "what is accessible"

I'm not sure to understand why integrating it into the MMU poses a problem. Is it a performance issue (critical path)? Or is it more conceptual ?

As it is already done in the MMU, even if isn't realy right for you. If we use this memRegion only in the baremetal section of MMU (requireMmuLockup = false), does it still not decoupled for you ?

Overall, i would say as a workaround, just setting the ioRegion to everything which is not cached should be good.

We've tried, but unfortunately there are problems with the riscv tests (machine), as you saw it in the pull request. And the solution of modifying rvls doesn't suit me, because for me it's the golden model, shouldn't it not be modified?

Moreover, would this change make store-to-load forwarding possible ?

I'm not sure to understand the relation between the two ?

By adding this address verification, we don't have to wait to be denied by the memory interconnect. So, if a load at the same address comes after a store, we can transfer the data to the load from the store queue without fear of missing a load access fault. As the store will trigger a store fault, and the load also. Of course this doesn't work for IO accesses.

We tested the addition of memRegion, and it passed all the NaxRiscvRegression tests, and the same tests with SocSim. The only problem is that the value of memRegion will be different between SocSim and NaxSoc, because the memory base address is different.

But for me, as an exception does not raise, the execution flow deviates from the expected way (spike).

As much as i have seen, other CPU do have the same behaviour (no store access fault trap), but instead they will rise some non maskable interrupts (imprecise exception)

I'm not sure to understand why integrating it into the MMU poses a problem. Is it a performance issue (critical path)? Or is it more conceptual ?

critical path issue as you may want to keep those checks one cycle bellow the MMU in the pipleine
Aswell as each memory port (there is 3 in Nax) could have different memory regions connected to them, so they all need their individual checking. (while some can reuse the same memory address translation hardware)

We've tried, but unfortunately there are problems with the riscv tests (machine), as you saw it in the pull request.

Ahhh that is curious. Can you provide a way for me to reproduce (exactly) via a git branch to clone and run your commands ? (i don't know where i swim / stack overflow)

So, if a load at the same address comes after a store, we can transfer the data to the load from the store queue without fear of missing a load access fault.

Why would the CPU need to wait on the store success before bypassing the data to a future load ? As anyway store fault aren't catched and things are already bad (when things are bad, more bad things doesn't realy matter)

memRegion

Nice :D Do you have a PR ?

The only problem is that the value of memRegion will be different between SocSim and NaxSoc, because the memory base address is different.

Hmm in what way it is a problem ?

(Note i will be in travel / holiday from tomorrow until end of next week)

As much as i have seen, other CPU do have the same behaviour (no store access fault trap), but instead they will rise some non maskable interrupts (imprecise exception)

Oh right, thank you for that information, it's true that I didn't look at how it worked on the other cores. And perhaps it's also catched by the PMP when they use it.

We've tried, but unfortunately there are problems with the riscv tests (machine), as you saw it in the pull request.

Ahhh that is curious. Can you provide a way for me to reproduce (exactly) via a git branch to clone and run your commands ? (i don't know where i swim / stack overflow)

The problem is here #115. The machine test fails because it want to test a store access fault on memory, so Nax and RVLS trigger a trap for this access, as wanted.

However as we modify the ioRange, the address tested 0x1230 is pushed by Nax into the RVLS ioQueue, but it's not popped out by RVLS, as for RVLS it's not an IO address. And then, when machine tests a store at 0x1fff_ffff, which is an IO address for both, an error appears because there is a misalignment in the ioQueue. Nax pushed the correct address 0x1fff_ffff, but RVLS popped the old address 0x1230.

The RVLS IO mapping is defined from the SocDemo mapping and not from ioRange, and it's difficult to change this.

With memRegion enabled, we don't have this problem. Because storing at 0x1230 triggers a trap as expected, and the address is not pushed into ioQueue. The queue misalignment therefore disappears.

Why would the CPU need to wait on the store success before bypassing the data to a future load ? As anyway store fault aren't catched and things are already bad (when things are bad, more bad things doesn't realy matter)

That's what I understood from your previous answer https://github.com/SpinalHDL/NaxRiscv/issues/43#issuecomment-1709653234.

The only problem is that the value of memRegion will be different between SocSim and NaxSoc, because the memory base address is different.

Hmm in what way it is a problem ?

It's not really a problem, I think we just need a different memRegion in NaxSoc. But what I wanted to say is that we haven't tested the change with NaxSoc yet. So we'll test it before making a PR, to make sure we don't break everything.

Enjoy your holidays !

SpinalHDL / NaxRiscv

Nax not trap on store fault #116