Original report from Nebuleon
I believe I have a proposal to optimise branch delay slot exception checking in
mupen64plus, please see this paste and tell me if what I said on how to proceed
is sane:
I know there's a Dynarec (which I call Old Dynarec in contrast with Ari64's New
Dynarec, so yeah, Old Dynarec) and also the New Dynarec to deal with, but I
believe the Old Dynarec would not be affected by the double setting of
[delay_slot = 0;], and the New Dynarec already has its own exception handler
Any resemblance between the addresses and instructions discussed in the paste
and real games is purely fortuitous. DISCLAIMER'D.
Let's imagine this instruction:
LW $7, 0($7) ; char* line = * (char**) next_msg_line;
and, because this instruction is in Conker or some game like that, which uses
the MMU and TLB, it's in a virtual page. Let's say it's at 0x1400C408:
1400C408: LW $7, 0($7) ; char* line = * (char**) next_msg_line;
Let's also say that the initial value of $7 is 0x15214000, and that memory at
that address is unmapped.
Run alone, without being in the delay slot of a branch, LW is implemented like
this:
[mupen64plus-core/src/r4300/interpreter_r4300.def]
const unsigned int lsaddr = (unsigned int)(iimmediate + irs32);
long long int *lsrtp = PC->f.i.rt;
ADD_TO_PC(1);
address = (unsigned int) lsaddr;
rdword = (unsigned long long *) lsrtp;
read_word_in_memory();
if (address)
sign_extended(*lsrtp);
It first resolves the address contained in the base register, adds the offset,
then resolves the target register, before adding 1 instruction to the PC
(putting it beyond the LW, at 1400C40C), reading from native memory at the
appropriate place for the N64 address, then, if the address is non-zero,
sign-extend the high part of the target register.
Wait... why 'if the address is non-zero'? We set it to something, and it was
clearly not zero already!
The answer is in the memory reader functions. The read_nomem function, which is
called when reading from virtual memory, is implemented like this:
[mupen64plus-core/src/memory/memory.c]
address = virtual_to_physical_address(address,0);
if (address == 0x00000000) return;
read_word_in_memory();
If the address is unmapped, virtual_to_physical_address will call
TLB_refill_exception, which will set the PC to an exception handler, and will
set EPC back to the start of the LW because it's a memory read and not an
instruction fetch. (That's the role of the second parameter to
virtual_to_physical_address, which is 0 here.)
In the case of reading from $7 and writing the value to $7, the write will not
have occurred, instead going straight to the exception handler. On return, the
exception handler will retry the LW, so the value of $7 MUST stay as it was.
[mupen64plus-core/src/r4300/exception.c]
if(w != 2) EPC-=4;
~ IN JIT CODE ~, the PC has to be set correctly if an exception could occur
while loading memory, but it doesn't have to be set if the read is occurring
from permanently-mapped memory like RDRAM.
In the possibly-exception-causing case, all temporary registers on the host
must be discarded to allow C functions to be called, then the memory reader
function is called, and then the 'address' variable must be checked. If the
'address' variable contains 0, the code must save all N64 registers whose value
is contained in native registers, pop callee-saved registers expected to be
used by the interpreter loop, and return. PC will have been set to an exception
handler.
Otherwise, no exception occurs, and the write of the register and the rest of
the block can continue as usual.
~ NOW IMAGINE THIS LW ~ as a delay slot of a branch:
1400C404: BGEZ $0, -7024 ; } -> while (true) {, in a loop with a 'break;'
1400C408: LW $7, 0($7) ; char* line = * (char**) next_msg_line;
and further imagine that the instruction at 1400A898 will be a null check for
$7, but that's not important.
What's important is how an exception in the delay slot is dealt with. All
branches are implemented like this when done within a 4 KiB page in the cached
interpreter:
[mupen64plus-core/src/r4300/r4300.c]
const int take_jump = (condition);
const unsigned int jump_target = (destination);
long long int *link_register = (link);
if (cop1 && check_cop1_unusable()) return;
if (!likely || take_jump)
{
PC++;
delay_slot=1;
UPDATE_DEBUGGER();
PC->ops();
update_count();
delay_slot=0;
if (take_jump && !skip_jump)
{
if (link_register != ®[0])
{
*link_register=PC->addr;
sign_extended(*link_register);
}
PC=actual->block+((jump_target-actual->start)>>2);
}
}
else
{
PC += 2;
update_count();
}
last_addr = PC->addr;
if (next_interupt <= Count) gen_interupt();
So first, it determines whether the branch is taken (which it is, because $0 is
always >= 0), but then it does this:
a) Set the PC to one instruction ahead, with the delay_slot flag set to 1.
b) Call the function responsible for implementing the opcode at the delay slot.
This is our LW.
c) When it returns, unset the delay_slot flag and update the Count register
with the range of instructions that was being executed (starting with last_addr
and ending at PC->addr).
d) If the instruction in the delay slot took an exception, skip_jump is set to
non-zero. So, only if NO exception was taken, the PC is set to the target of
the branch. Otherwise, it was set by exception_general or TLB_refill_exception.
e) last_addr is updated to point to the new PC, either the branch target or the
start of the exception handler. This is for the next Count update; see 'c)'.
f) If an interrupt needs to occur, gen_interupt [sic] is called.
The delay_slot flag is required for exception_general and TLB_refill_exception
to properly set EPC or ErrorEPC and the Cause register. So it's important to
set it to 1 or 0 when appropriate.
Then our LW runs, sets its PC ahead, calls its memory accessor, which decides
that an exception occurred. The exception handling path is like this:
a) read_nomem has called TLB_refill_exception, using up the delay_slot
variable, setting EPC or ErrorEPC as well as Cause, setting the PC to the
cached interpreter's information structure for the exception handler, setting
skip_jump to non-zero, setting next_interupt to 0, updating Count and setting
last_addr.
b) read_nomem returns 0, indicating that the register must not be written.
c) LW catches this and does not do the write, then returns to BGEZ at 'c)'
above.
d) BGEZ updates the Count register with the range of instructions at last_addr
.. PC->addr, which is 0 instructions long. (This is wasteful, so it can be
eliminated in JIT code.)
e) delay_slot is unset, because we are not executing a delay slot anymore (and
the exception handler does not start with a delay slot!)
f) skip_jump has been set to non-zero, so the PC update for the branch target
is skipped.
g) last_addr is set to the address of PC, to start the next range of
instructions to update Count with. (In the case of the exception handler, this
is wasteful because exception_general has already filled last_addr, so it can
be eliminated in JIT code.)
h) Because next_interupt was set to 0 by 'a)' of the exception handling path,
an interrupt is made to occur unconditionally.
i) gen_interupt sees that skip_jump is non-zero, and then skips the interrupt
it was about to generate. It actually works to restore the next_interupt cycle
count from the interrupt queue, unsets skip_jump, and restores PC to the
address of the exception handler that was actually stored in skip_jump all
along:
[mupen64plus-core/src/r4300/interupt.c]
if (skip_jump)
{
unsigned int dest = skip_jump;
skip_jump = 0;
if (q->count > Count || (Count - q->count) < 0x80000000)
next_interupt = q->count;
else
next_interupt = 0;
last_addr = dest;
generic_jump_to(dest);
return;
}
This is highly wasteful, and could be avoided entirely by instead skipping the
interrupt if an exception has occurred in the delay slot:
if (take_jump && !skip_jump) ...;
else if (skip_jump) { skip_jump = 0; return; }
skip_jump would still be set, but would only be an exception indicator;
gen_interupt would not need to check it anymore.
~ IN JIT CODE, THIS BRANCH ~ would be made much more efficient if the interrupt
check processing can be skipped if an exception occurred, and if LW (and other
possibly-exception-causing opcodes) could just return.
The branch would be emitted like this:
a) Emit the delay slot, LW. LW is a load-perform-store opcode emitted like this:
a) Emit code to read the value of the base address.
b) Emit code to add the offset, optionally. (In our case it's 0.)
c) Emit code to check whether the resulting address is in N64 RDRAM, and jump to 'e)' if not.
d) Emit code to load from RDRAM without checking for exceptions then jump to the end of LW at 'i)'.
e) Emit code to calculate the address of the memory reader function, then call it.
f) Emit code to check, on return, that 'address' is non-zero, and if not, jump ahead to a special exception handling area at 'h)'. It is expected that 'address' is non-zero. Falling through in that case allows speculative execution to work on some processors, as well as to avoid branch penalties.
g) Emit code to store the result to the target register, then skip past the exception handling area that follows at 'h)'.
h) Emit code to return to the cached interpreter loop. Its next iteration will get to the exception handler.
i) End of LW.
b) From here on, if an exception occurred in the delay slot, the delay slot
will have returned, with delay_slot still set to 1. Emit code to save N64
registers used by previous instructions in the same block, if appropriate, then
pop callee-saved registers expected to be used by the interpreter loop.
c) Emit code to update Count.
d) Emit code to update the PC.
e) Emit code to return to the interpreter loop.
A modification is required to exception_general due to delay_slot being still
set to 1: if the delay_slot is 1, set it to 0 unconditionally. This would also
allow the delay_slot = 0; in the cached interpreter functions for branches to
be skipped in the case of an exception occurring in the delay slot as well.
Original issue reported on code.google.com by ursula.a...@web.de on 18 May 2014 at 6:44
Original issue reported on code.google.com by
ursula.a...@web.de
on 18 May 2014 at 6:44