Open ghost opened 1 year ago
So, this can be worked around by manually using the Add reference dialog, creating a function typedef, and assigning code as segment instead of mem. What is not clear is what type of reference should be used: DATA probably not, COMPUTED_CALL, UNCONDITIONAL_CALL.... ?
Of course this is all extremely suboptimal to do by hand, I might write a script for this purpose.
How do you tell Ghidra "this manual reference is a pointer used to call a function at the address it points to", so that xrefs can be populated properly? (otherwise all you get is a "XREF: mem:...." which does not contribute any call flow information).
Ex. this should provide a point of reference for the analyzer to 1) detect functions 2) adjust lowbyte/highbyte EIND-style calls obtained from the data structure with function pointers. Ultimately, the decompiler could accurately show you the actual function being called, while the disassembly shows the load/call steps (no way around displaying asm as it is... although some nice decorative or clickable elements could be neat to have).
Thanks!
So, a quick update:
I was able to get around this issue (but not the limitation/problem Ghidra has in adequately processing word/16bit integers and pointers for some instructions, like sts/ldi in Xmega, as far as I have confirmed) through scripting:
def check_instructions(func, instructions):
pointer_count = 0
instr_count = 0
first_fptr_addr = None
prev_target_addr = None
prev_reg_1 = None
prev_reg_2 = None
for instruction in instructions:
addr = instruction.getAddress()
oper = instruction.getMnemonicString()
if oper == 'ldi':
# find second ldi
ldi_2 = instructions.next()
# a second ldi ins indicates a potential ptr to code
if ldi_2 and ldi_2.getMnemonicString() == 'ldi':
# we are only interested in R24/R25 for now
reg_1 = instruction.getRegister(0).getName()
reg_2 = ldi_2.getRegister(0).getName()
if check_registers(reg_1, reg_2):
# the next two instructions will be sts/sts
sts_1 = instructions.next()
if sts_1.getMnemonicString() == 'sts':
sts_2 = instructions.next()
if sts_2.getMnemonicString() == 'sts':
ldi_1 = instruction
# we got it. now we have the full set ldi/sts
ldi_1_input = str_rmprefix(str(ldi_1.getInputObjects()[0]), '0x')
ldi_2_input = str_rmprefix(str(ldi_2.getInputObjects()[0]), '0x')
sts_1_output = sts_1.getOpObjects(0)[0]
sts_2_output = sts_2.getOpObjects(0)[0]
# convert to code addr via AddressFactory
addr_str = "0x{}{}".format(ldi_2_input, ldi_1_input)
target_addr = af.getAddress(addr_str)
# the target address for the ptr is ldi2/ldi1 (HI/LOW)
#print("Pointer to: {}".format(target_addr))
#print(" STS: {} => {}".format(ldi_1_input, sts_1_output))
#print(" STS: {} => {}".format(ldi_2_input, sts_2_output))
desc = "Function pointer to: {}".format(target_addr)
if fm.getFunctionContaining(target_addr):
desc += " ({})".format(fm.getFunctionContaining(target_addr).getName())
comment_instruction(sts_1, desc)
handle_fptr(sts_1_output, target_addr)
if pointer_count == 0:
first_fptr_addr = sts_1_output
prev_target_addr = target_addr
prev_reg_1 = reg_1
prev_reg_2 = reg_2
pointer_count += 1
elif oper == 'sts' and prev_target_addr != None:
# find second sts (repeat call, reuses previous regs)
sts_2 = instructions.next()
if sts_2 and sts_2.getMnemonicString() == 'sts':
if check_registers(prev_reg_1, prev_reg_2):
sts_1_output = instruction.getOpObjects(0)[0]
sts_2_output = sts_2.getOpObjects(0)[0]
print(sts_2_output)
print("Repeat STS pointer to: {} => {}".format(sts_2_output, target_addr))
desc = "Repeat function pointer to: {}".format(target_addr)
comment_instruction(instruction, desc)
handle_fptr(sts_1_output, target_addr)
instr_count += 1
if pointer_count > 0:
desc = "Function pointers at {}".format(first_fptr_addr)
cu = listing.getCodeUnitAt(func.getEntryPoint())
cu.setComment(CodeUnit.EOL_COMMENT, desc)
return pointer_count
def main():
state = getState()
addr = state.getCurrentAddress()
if fm.isInFunction(addr):
# Retrieve the function object for processing
func = fm.getFunctionContaining(addr)
addrSet = func.getBody()
instructions = listing.getCodeUnits(addrSet, True)
print("Processing function at {}".format(addr))
check_instructions(func, instructions)
main()
Intentionally incomplete so that third-parties interested in seeing this fixed report issues to you instead of scooping up freebies from other folks.
It's ghetto royale, since I did not feel particularly inclined towards writing a state machine (the proper approach) to handle the successive LDI/STS calls in a manner which keeps record of what is going on, and I just needed something quick.
Instead I just handle simple ldi/sts and ldi/sts/.../sts repeat sequences. The missing code handles data references and other things and is irrelevant in this context.
This is very sample-specific, but Ghidra has some trouble understanding these primitives:
Assembly is worth a thousand words:
ldi R24,0xBB // BB is MSB
ldi R25,0xAA // AA is MSB, 0xaabb is within the code segment
sts BLA_mem_512d,R24
sts BLA_mem_512d[0]+1,R25 // 512d will contain 0xaabb
And:
ldi R18,0xBB
ldi R19,0xAA
sts DAT_mem_xx92,R18
sts DAT_mem_xx93,R19
sts DAT_mem_xx96,R24
sts DAT_mem_xx97,R25
Compiler optimizations will also squeeze other sts primitives with other registers, so a state machine is the only way to handle these properly (FIFO store the last value written to each register in a dictionary or key-value structure, retrieve it when the STS primitive is processed, rinse and repeat).
(last but not least) Thank you for working on the AVR8 issues!
A test case:
Given a PTR_struct_mgee
data variable that is a runtime configured pointer to a struct mgee
structure containing several function pointers, each with a proper typedef set in the DTM for Ghidra, the following comes up in the decompiler:
uVar2 = EIND;
(*(code *)CONCAT12(uVar2,*(undefined2 *)
(CONCAT11(PTR_struct_mgee._1_1_,
PTR_struct_mgee._0_1_) + 6)))
For:
lds Zlo,PTR_struct_mgee = mem:(set by hand/via ref)
lds Zhi,PTR_struct_mgee
ldd R0,Z+0x6
ldd Zhi,Z+0x7
...
eicall
The ideal output should be:
somevar = PTR_struct_mgee->third_func(blah);
Assuming sizeof(pointer)=2 (16bit, as is the case), so +6 offset would be pointing at the n=6/sizeof(pointer) function AFAIK.
AVR8 programs might employ structures that are function pointer tables with a one byte boolean field at the beginning, followed by a fixed amount of function pointers pointing at the code segment.
These are not adequately detected. I will stick to using decompiler output for simplification purposes:
Looking back at the data (sram in this case) segment, we will observe that these are loaded MSByte/LSByte.
This is the fun part: I created function typedefs after manually navigating towards the actual function inside the code segment. The good news is that this could be used to aid the analyzer in finding previously unknown functions in AVR8, as right now it struggles to find those that do not begin with cookie cutter prologues.
Now the problem: it fails to differentiate code and mem segments. It defaults to mem. Navigating by clicking takes me to the mem segment... and obviously the actual references to the functions are not taken into account.
I initialized the values manually, as I had manually initialized the entire sram to 0 and set the initialized flag for it. I know this is a potential headache in the making, but it also helps figuring out things whenever a function or executable code attempts to access it and then writes data to the segment. And you can set/overwrite the values later on as needed.
Python console:
I see a great opportunity to improve auto-analysis for finding these. Most of these are EIND calls later on, by the way. But the primitive of "function pointer to executable code segment with data that disassembles into a known function or new one" seems reasonably easy to detect and leverage both for references and function detection.