mupen64plus: memory loads and branch delay slot exceptions not optimized

Original report from Nebuleon

I believe I have a proposal to optimise branch delay slot exception checking in 
mupen64plus, please see this paste and tell me if what I said on how to proceed 
is sane:
I know there's a Dynarec (which I call Old Dynarec in contrast with Ari64's New 
Dynarec, so yeah, Old Dynarec) and also the New Dynarec to deal with, but I 
believe the Old Dynarec would not be affected by the double setting of 
[delay_slot = 0;], and the New Dynarec already has its own exception handler
Any resemblance between the addresses and instructions discussed in the paste 
and real games is purely fortuitous. DISCLAIMER'D.

Let's imagine this instruction:

  LW    $7, 0($7)   ; char* line = * (char**) next_msg_line;

and, because this instruction is in Conker or some game like that, which uses 
the MMU and TLB, it's in a virtual page. Let's say it's at 0x1400C408:

  1400C408:  LW    $7, 0($7)   ; char* line = * (char**) next_msg_line;

Let's also say that the initial value of $7 is 0x15214000, and that memory at 
that address is unmapped.

Run alone, without being in the delay slot of a branch, LW is implemented like 
this:

[mupen64plus-core/src/r4300/interpreter_r4300.def]
  const unsigned int lsaddr = (unsigned int)(iimmediate + irs32);
  long long int *lsrtp = PC->f.i.rt;
  ADD_TO_PC(1);
  address = (unsigned int) lsaddr;
  rdword = (unsigned long long *) lsrtp;
  read_word_in_memory();
  if (address)
    sign_extended(*lsrtp);

It first resolves the address contained in the base register, adds the offset, 
then resolves the target register, before adding 1 instruction to the PC 
(putting it beyond the LW, at 1400C40C), reading from native memory at the 
appropriate place for the N64 address, then, if the address is non-zero, 
sign-extend the high part of the target register.

Wait... why 'if the address is non-zero'? We set it to something, and it was 
clearly not zero already!

The answer is in the memory reader functions. The read_nomem function, which is 
called when reading from virtual memory, is implemented like this:

[mupen64plus-core/src/memory/memory.c]
  address = virtual_to_physical_address(address,0);
  if (address == 0x00000000) return;
  read_word_in_memory();

If the address is unmapped, virtual_to_physical_address will call 
TLB_refill_exception, which will set the PC to an exception handler, and will 
set EPC back to the start of the LW because it's a memory read and not an 
instruction fetch. (That's the role of the second parameter to 
virtual_to_physical_address, which is 0 here.)

In the case of reading from $7 and writing the value to $7, the write will not 
have occurred, instead going straight to the exception handler. On return, the 
exception handler will retry the LW, so the value of $7 MUST stay as it was.

[mupen64plus-core/src/r4300/exception.c]
  if(w != 2) EPC-=4;

~ IN JIT CODE ~, the PC has to be set correctly if an exception could occur 
while loading memory, but it doesn't have to be set if the read is occurring 
from permanently-mapped memory like RDRAM.

In the possibly-exception-causing case, all temporary registers on the host 
must be discarded to allow C functions to be called, then the memory reader 
function is called, and then the 'address' variable must be checked. If the 
'address' variable contains 0, the code must save all N64 registers whose value 
is contained in native registers, pop callee-saved registers expected to be 
used by the interpreter loop, and return. PC will have been set to an exception 
handler.

Otherwise, no exception occurs, and the write of the register and the rest of 
the block can continue as usual.

~ NOW IMAGINE THIS LW ~ as a delay slot of a branch:

  1400C404:  BGEZ  $0, -7024   ; } -> while (true) {, in a loop with a 'break;'
  1400C408:  LW    $7, 0($7)   ; char* line = * (char**) next_msg_line;

and further imagine that the instruction at 1400A898 will be a null check for 
$7, but that's not important.

What's important is how an exception in the delay slot is dealt with. All 
branches are implemented like this when done within a 4 KiB page in the cached 
interpreter:

[mupen64plus-core/src/r4300/r4300.c]
  const int take_jump = (condition);
  const unsigned int jump_target = (destination);
  long long int *link_register = (link);
  if (cop1 && check_cop1_unusable()) return;
  if (!likely || take_jump)
  {
    PC++;
    delay_slot=1;
    UPDATE_DEBUGGER();
    PC->ops();
    update_count();
    delay_slot=0;
    if (take_jump && !skip_jump)
    {
      if (link_register != &reg[0])
      {
        *link_register=PC->addr;
        sign_extended(*link_register);
      }
      PC=actual->block+((jump_target-actual->start)>>2);
    }
  }
  else
  {
    PC += 2;
    update_count();
  }
  last_addr = PC->addr;
  if (next_interupt <= Count) gen_interupt();

So first, it determines whether the branch is taken (which it is, because $0 is 
always >= 0), but then it does this:
a) Set the PC to one instruction ahead, with the delay_slot flag set to 1.
b) Call the function responsible for implementing the opcode at the delay slot. 
This is our LW.
c) When it returns, unset the delay_slot flag and update the Count register 
with the range of instructions that was being executed (starting with last_addr 
and ending at PC->addr).
d) If the instruction in the delay slot took an exception, skip_jump is set to 
non-zero. So, only if NO exception was taken, the PC is set to the target of 
the branch. Otherwise, it was set by exception_general or TLB_refill_exception.
e) last_addr is updated to point to the new PC, either the branch target or the 
start of the exception handler. This is for the next Count update; see 'c)'.
f) If an interrupt needs to occur, gen_interupt [sic] is called.

The delay_slot flag is required for exception_general and TLB_refill_exception 
to properly set EPC or ErrorEPC and the Cause register. So it's important to 
set it to 1 or 0 when appropriate.

Then our LW runs, sets its PC ahead, calls its memory accessor, which decides 
that an exception occurred. The exception handling path is like this:
a) read_nomem has called TLB_refill_exception, using up the delay_slot 
variable, setting EPC or ErrorEPC as well as Cause, setting the PC to the 
cached interpreter's information structure for the exception handler, setting 
skip_jump to non-zero, setting next_interupt to 0, updating Count and setting 
last_addr.
b) read_nomem returns 0, indicating that the register must not be written.
c) LW catches this and does not do the write, then returns to BGEZ at 'c)' 
above.
d) BGEZ updates the Count register with the range of instructions at last_addr 
.. PC->addr, which is 0 instructions long. (This is wasteful, so it can be 
eliminated in JIT code.)
e) delay_slot is unset, because we are not executing a delay slot anymore (and 
the exception handler does not start with a delay slot!)
f) skip_jump has been set to non-zero, so the PC update for the branch target 
is skipped.
g) last_addr is set to the address of PC, to start the next range of 
instructions to update Count with. (In the case of the exception handler, this 
is wasteful because exception_general has already filled last_addr, so it can 
be eliminated in JIT code.)
h) Because next_interupt was set to 0 by 'a)' of the exception handling path, 
an interrupt is made to occur unconditionally.
i) gen_interupt sees that skip_jump is non-zero, and then skips the interrupt 
it was about to generate. It actually works to restore the next_interupt cycle 
count from the interrupt queue, unsets skip_jump, and restores PC to the 
address of the exception handler that was actually stored in skip_jump all 
along:

[mupen64plus-core/src/r4300/interupt.c]
  if (skip_jump)
  {
    unsigned int dest = skip_jump;
    skip_jump = 0;

    if (q->count > Count || (Count - q->count) < 0x80000000)
      next_interupt = q->count;
    else
      next_interupt = 0;

    last_addr = dest;
    generic_jump_to(dest);
    return;
  }

This is highly wasteful, and could be avoided entirely by instead skipping the 
interrupt if an exception has occurred in the delay slot:

    if (take_jump && !skip_jump) ...;
    else if (skip_jump) { skip_jump = 0; return; }

skip_jump would still be set, but would only be an exception indicator; 
gen_interupt would not need to check it anymore.

~ IN JIT CODE, THIS BRANCH ~ would be made much more efficient if the interrupt 
check processing can be skipped if an exception occurred, and if LW (and other 
possibly-exception-causing opcodes) could just return.

The branch would be emitted like this:
a) Emit the delay slot, LW. LW is a load-perform-store opcode emitted like this:
  a) Emit code to read the value of the base address.
  b) Emit code to add the offset, optionally. (In our case it's 0.)
  c) Emit code to check whether the resulting address is in N64 RDRAM, and jump to 'e)' if not.
  d) Emit code to load from RDRAM without checking for exceptions then jump to the end of LW at 'i)'.
  e) Emit code to calculate the address of the memory reader function, then call it.
  f) Emit code to check, on return, that 'address' is non-zero, and if not, jump ahead to a special exception handling area at 'h)'. It is expected that 'address' is non-zero. Falling through in that case allows speculative execution to work on some processors, as well as to avoid branch penalties.
  g) Emit code to store the result to the target register, then skip past the exception handling area that follows at 'h)'.
  h) Emit code to return to the cached interpreter loop. Its next iteration will get to the exception handler.
  i) End of LW.
b) From here on, if an exception occurred in the delay slot, the delay slot 
will have returned, with delay_slot still set to 1. Emit code to save N64 
registers used by previous instructions in the same block, if appropriate, then 
pop callee-saved registers expected to be used by the interpreter loop.
c) Emit code to update Count.
d) Emit code to update the PC.
e) Emit code to return to the interpreter loop.

A modification is required to exception_general due to delay_slot being still 
set to 1: if the delay_slot is 1, set it to 0 unconditionally. This would also 
allow the delay_slot = 0; in the cached interpreter functions for branches to 
be skipped in the case of an exception occurring in the delay slot as well.
Original issue reported on code.google.com by ursula.a...@web.de on 18 May 2014 at 6:44
Galiro / mupen64plus

mupen64plus: memory loads and branch delay slot exceptions not optimized #613