Background

x86 is a segmented architecture. Every x86 instruction that reads or writes memory except for two (lgdt and lidt) use "logical addresses" that take the form of segment:offset. These produce a linear address by adding the offset to the segment's base address. If paging is enabled, the linear addresses are then mapped via the page tables to physical addresses.

The consequence of this architecture is that an instruction like mov eax, [esi] is actually loading four bytes from ds:esi where the ds register holds the segment selector that maps to a segment in the global descriptor table register (GDTR). The corresponding segment's base address which is used to compute the linear address is stored in the GDTR.

Modern operating systems load the GDTR with segments that cover the entire 32-bit linear address space. That is, they have a base of 0 and a limit of 0xFFFFFFFF.

However, not everything one would like to reverse is an application running on a modern operating system.

Feature request

It would be fantastic if Binary Ninja supported segments (or, more generically, different address spaces which appear to be common in GPU architectures).

A single BinaryView could be loaded with all of the segments and individual functions or perhaps instructions could maintain the current value of segment registers. This would enable references to data to actually produce a reference to the correct data segment rather than to code at the same address.

It would also be great if segments could be marked 16, 32, or 64-bit. (The 64-bit segments loaded into cs, ds, es, and ss must have a base of 0 but fs and gs are allowed to have arbitrary bases, although the limit isn't checked on access.)

Work arounds that don't work (well)

The two work arounds that I have tried or have been suggested don't work:

Create multiple BinaryViews. This is the approach taken by the NES plugin. It should be obvious from examining the code that this approach is really suboptimal for the NES and totally unworkable for x86.

(In particular, it requires creating multiple essentially identical BinaryView subclasses, one per bank in the NES ROM, that have to each be registered and each must match against the file. Doing the same for x86 means creating a bunch of identical BinaryView subclasses in order to match the possibly dozens of segments A better approach would be for a single loader to create the appropriate BinaryViews for a file, but this isn't supported.)

For the x86, this approach doesn't work because there's no way for instructions which live in one of the code segments to reference data that live in one of the data segments.
Load all of the code at the appropriate linear address. This approach doesn't work because both code and data addresses are relative to the appropriate segments.

Most references to code are eip-relative. E.g., most jmp and call instructions so it mostly doesn't matter where code is loaded. But it's not uncommon for the address of a function to be loaded into a register, say ebx, and multiple calls to a function become call ebx. Since this would be a segment-relative address rather than eip-relative, the reference would be wrong.

All data references in this case would be relative to address 0. Loading the data segment at address 0 would work for a single data segment; however, multiple data segments are common and thus this doesn't work.

Comparison to IDA

IDA has limited support for segments. Its support for real-mode segments is significantly better than its support for protected-mode segments. In particular, IDA doesn't allow segments with a base address that's not a multiple of 16. Protected-mode segments have no alignment requirements and real code uses arbitrary segment base addresses.

IDA allows the user to specify the value of the segment registers at any point in the program and, if you've correctly set up segment selectors and segments (and this is nontrivial to get right), references point to the appropriate location.

IDA also allows the user to specify that a pointer in data is relative to a particular segment.

Finally, IDA has a data-flow analysis that allows it to notice how the segment registers change. E.g., the x86 string instructions use es as the destination segment selector so it's common for es to be pushed to the stack and then set to the value of ds at the beginning of a function using string instructions. The old value of es is popped at the end. IDA handles that pretty well.

Generality

A slightly more general approach than doing exactly what x86 is doing is to support the notion of an address space and allow references to different address spaces. Harvard architectures (e.g., AVR) have separate code and data spaces which are accessible via different instructions. Supporting them in Binary Ninja would need something along these lines.

I would be inclined to model this as a single unified address space as well as a collection of (possibly overlapping) address spaces that refer to a contiguous region of the unified address space.

For x86, most programs would only ever need the unified address space. But for those that use segments, the linear address space would be the unified one and each of the code and data segments could correspond to the other address spaces referencing their portion of the linear address space.

Vector35 / binaryninja-api

Feature request: Support x86 protected-mode segments #936

Background

Feature request

Work arounds that don't work (well)

Comparison to IDA

Generality