SERV liveness check - Githubissues

olofk commented 1 year ago

Hi there,

I was just about to roll up my sleeves and fix the SERV support when I noticed that @KrystalDelusion just had done pulled in my fixes the other week. Awesome! Thanks for that. I also noticed that the liveness check was disabled, which matches my memory that I never managed to fix that. So my question is, how is the liveness check supposed to work in the first place? If I understand correctly (which might very well not be the case) it looks like it always fails because the second instruction fails to complete, which in turn is caused by the testbench just delays the response from the memory long enough to always make this fail.

There seems to be a safeguar against this by defining MEMIO_FAIRNESS to set a maximum rersponse time for the memory, but if I define that, I get Assumptions are unsatisfiable! instead and no trace at all. Could someone with a bit more insight into this help me pin down what the actual issue is to pass this final check?

KrystalDelusion commented 1 year ago

My understanding of the liveness check is that it is verifying that the core never hangs indefinitely, or more accurately that a single instruction will never take more than <check> - <trigger> cycles to complete where <check> is the third depth value and <trigger> is the second (i.e. liveness 1 50 80 says check=80, trigger=50). Any external signals that could cause the core to hang (such as waiting for memory) must therefore be emulated by the wrapper.

While testing with your changes when I was merging them here, I was able to fix the unsat problem by extending the timeout_ibus and timeout_dbus registers to 4 or 5 bits. This was able to provide a counterexample trace, but I didn't have enough knowledge of SERV to track down the issue. I think last time I was testing I tested up to a check depth of 150 and then went down a rabbit hole of looking into the MDU to see if that was causing issues before giving up.

I had another quick look at this just now trying to replicate what I did previously and I managed to get it to pass the test with check=160 (or trigger=40, check=150), so I guess it just needed a few extra cycles ? Feel free to have a look at the pr #23 ! There may be a better way to handle it so that if it does fail in the future it doesn't have to run quite so many cycles of tracing.

olofk commented 1 year ago

Aha! I see. That makes sense. So in the case of SERV, I guess the longest possible instruction would be a right shift that has a worst-case execution time of ~96 cycles + some memory access overhead. This could explain why we need a 100+ cycle window between the trigger and check.

Still haven't got a clue why the length of the bus timeout counters has anything to do with this, but I tried your changes at it works for me as well, and the test doesn't take all that long to complete, so I'm all happy with this

KrystalDelusion commented 1 year ago

Because it needs to complete an instruction in the trigger cycle, I think what’s happening is that it is relying on the timeout to make the instruction take long enough. My guess is that making the trigger depth longer would allow for more than one instruction to complete and relieve pressure on the timeout and allow it to be smaller, but I haven’t played around with it enough to be sure.

YosysHQ / riscv-formal

SERV liveness check #20