NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
51.23k stars 5.84k forks source link

SLEIGH reference documentation on cpool / CPOOLREF is rather anaemic #3540

Open gsuberland opened 2 years ago

gsuberland commented 2 years ago

I'm really struggling to understand what cpool does and how it works. The existing documentation on cpool can be summarised as follows:

From sleigh_ref.html

CPOOLREF | cpool(v0,...) | Access value from the constant pool.

From sleigh_constructors.html

The constant pool operator, cpool, returns sizes, offsets, addresses, and other structural constants. It behaves like a query to the architecture about these constants. The first parameter is generally an object reference, and additional parameters are constants describing the particular query. The operator returns the requested value. In the following example, an object reference regParamC and the encoded constant METHOD_INDEX are sent as part of a query to obtain the final destination address of an object method.

:invoke_direct METHOD_INDEX,regParamC
               is inst0=0x70 ; N_PARAMS=1 & METHOD_INDEX & regParamC
{
  iv0 = regParamC;
  destination:4 = cpool( regParamC, METHOD_INDEX, $(CPOOL_METHOD));
  call [ destination ];
}

From pseudo-ops.html

This operator returns specific run-time dependent values from the constant pool. This is a concept for object-oriented instruction sets and other managed code environments, where some details about how instructions behave can be deferred until run-time and are not directly encoded in the instruction. The CPOOLREF operator acts a query to the system to recover this type of information. The first parameter is a pointer to a specific object, and subsequent parameters are IDs or other special constants describing exactly what value is requested, relative to the object. The canonical example is requesting a method address given just an ID describing the method and a specific object, but CPOOLREF can be used as a placeholder for recovering any important value the system knows about. Details about this instruction, in terms of emulation and analysis, are necessarily architecture dependent.

I was unable to form an understanding of how cpool and the constant pool works based on the existing documentation, even when referring to usages of cpool in existing implementations.

The questions I have are:

It'd be helpful if the SLEIGH documentation around constant pools and cpool could be expanded to help answer these questions.

ghidracadabra commented 2 years ago

The cpool op is not a general-purpose operation intended for use by any architecture. Instead, it is intended to handle the "constant pool" in java .class files and similar constructs (hence the name). In a .class file, the constant pool is a data structure which stores (among many other things) information needed to model JVM bytecode operations correctly.

A simple example is the getfield instruction, which pushes a field of an object onto the operand stack. That field can be either 4 or 8 bytes, depending on whether it's an int/long/object reference, etc. In order to determine whether it's 4 or 8 bytes, you need to look at data in the constant pool - it's not directly encoded in the bytes of the instruction. When the decompiler is processing a method in a .class file and encounters a getfield instruction, it has to pause and ask the rest of Ghidra to examine the data in the constant pool and report back how many bytes are being pushed onto the operand stack.

Essentially, properly modelling this instruction requires information located in another part of the binary being analyzed. That other part of the binary is a data structure that itself needs to be parsed and analyzed. The cpool op is the way that the decompiler can get information from this data structure. Note that this isn't the same as handling something like an indirect call whose destination is only known at runtime.