Closed andreafioraldi closed 3 years ago
Hi, PE executables are definitely something we'd like to support. The main requirements would be:
We figured out how to do this for ELF/Linux, but have not studied Windows yet.
With LIEF you can link dynamic libraries to a PE, and IIRC you can also set that its initializer (DllMain) has to be the first to run. Always with LIEF you can easily add sections to binaries, so maybe you can just have the trampolines in memory and relocate to the right location before start.
Thanks, I will need to study it.
Here's an overview on Windows that may be a helpful reference.
The PE header has a field called AddressOfEntryPoint
which specifies the relative virtual address (RVA) of the entry point. RVAs are virtual addresses relative to the base address of the module. As such, RVAs are not affected by ASLR, since you can load the module to any address you like and the RVA will be relative to that. The section table of the PE maps file data (locations specified as linear offsets into the file, referred to as "raw address") into virtual memory (positions specified as RVAs).
Generally speaking, the PE starts with structures that are accessed as linear offsets from the beginning of the file (e.g. DOS header, NT/PE headers, data directories, section headers), and those structures use RVAs to point to other (usually optional) structures that are inside sections and therefore get mapped into memory, e.g. directories (import, export, exceptions, etc.) and other metadata.
Statically adding a single additional section to the PE header usually does not require a complete rebuild of the PE header structure, or any re-computation of offsets. This is because there is almost always a gap between the end of the section table and the start of the first section's data, with more than enough space to inject another section. The gap occurs because raw addresses for sections are aligned to 0x200 (this is technically arbitrary and specified by the FileAlignment
header field, but 0x200 is the smallest possible value these days), and the section table is the last part of the PE header that is accessed as a linear offset in the file, rather than through a section. In most executables the first section's data will be at 0x400, and the end of the section table is somewhere around 0x2E0 to 0x300, leaving upwards of 200 bytes to play with. Each section table entry is 40 bytes long, and the end of the section table is indicated by an all-zero entry, so as long as the last entry in the PE's section table ends at least 80 bytes before the first section's contents, you can do an in-place section injection without needing to change anything else.
If you do need to move the sections to make space for an additional entry in the section table, all you should need to do is shift all the data in the file, after the section table ends, along by the amount specified in the FileAlignment
field (usually 0x200) and then increment all the raw addresses in the section table by that amount. Everything that is contained inside section data is referred to via its RVA, so moving the raw address does not affect anything. Even the relocation tables are done via RVAs.
The new section for the loader stub should be marked as initialised (i.e. it contains data that should be mapped into virtual memory), readable, and executable. You do not have to set the IMAGE_SCN_CNT_CODE
flag (i.e. "contains code") on the section - while the .text section usually has this flag set, it is a legacy x86 thing from when segments were important, and the only thing that matters is that you set the section as executable.
Your loader stub can find the base address of the module using the Process Environment Block (PEB). For 64-bit processes this is accessible via the GS
segment register. The segment register points to the Thread Environment Block (TEB), which has a pointer to the PEB at offset 0x60, which you can read as mov rax, gs:[0x60]
. Offset 0x10 of the PEB contains the image base address for the main executable module. This points to the PE header in memory. You can carry on from there much in the same manner as you would with an ELF executable.
If you need to call APIs from your loader stub, there are two approaches you might take. You might think of walking the import directory of the PE to find the import you want. However, it is not guaranteed that the PE you're injecting into already imports the API that you want. There are two reliable options here: statically inject the imports you need, or walk the loaded modules list at runtime. Statically injecting requires modifying the import table, which may require re-building the entire table. The import table is inside a section, so if you need to resize it then it might be better to inject a new section, re-build the import table into that new section, and finally update the data directory entry to point to that new import table. It's possible to do this without breaking things, but it's certainly a bit more involved. A runtime approach that avoids this problem is to use the PEB_LDR_DATA
pointer from the PEB to get at one of the doubly-linked lists of modules that have been loaded into memory - InLoadOrderLinks
or InMemoryOrderLinks
. Each entry in the list is a LDR_DATA_TABLE_ENTRY
structure which specifies the base address of the module, its size, and its name. This allows you to find the module you want (e.g. kernel32.dll) and the address of its PE header in memory, then walk that module's export table to find the address of the API you want. Despite being undocumented, the PEB and LDR structure layouts have remained stable since the Windows 3.x era. There are plenty of shellcode examples out there that use this approach, which may be helpful to you.
Now let's talk about mappings.
The Windows API for dynamically mapping files into virtual memory is largely based around three core APIs: CreateFile
, CreateFileMapping
, and MapViewOfFile
. The first is the standard API for opening handles to files on the filesystem. The CreateFileMapping
API creates a mapping object, optionally tied to a file handle if you want to have the contents of the mapping backed by a file. Mapping objects are created at the system level and can be shared across processes, allowing for shared pages / CoW as well as IPC via shared memory. Mapping objects can be named for convenient access across processes. Finally, MapViewOfFile
allows you to take a mapping object and load it into virtual memory so it can be accessed.
In short, to map the contents of a file into virtual memory, you call CreateFile
to open a handle to the file, then CreateFileMapping
to create a mapping object associated with that file, and finally MapViewOfFile
to map the contents of that mapping object into virtual memory. Each file you map requires two handles: one to the file, one to the mapping. You've got a hard limit of 16M handles per process.
However, the handle limit should not practically apply here. The MapViewOfFileEx
API has parameters for the file offset that you want to map, the number of bytes you wish to map, the virtual base address you would like to map it at, and the page access flags you want to set for the region. This means you can load all of your trampolines into a single file (e.g. append them to the PE itself), open a handle to the file, create a mapping from that handle, and then create as many views of that mapping as you like without adding additional handles to the process handle table.
There are some restrictions here. The file offset you provide to MapViewOfFile
must be a multiple of the memory allocation granularity (in practice, 4096 everywhere), and the same applies to the base address of the view. It isn't clear to me what the exact requirements are for the approach described in the paper, but I don't see why MapViewOfFile
should be any different from mmap
here.
If you've got more questions I'd be happy to help where I can. I don't know enough about e9patch's internals to write a PE/Windows port myself, but I should know enough about Windows internals to help guide you through snags and implementation details.
@gsuberland Thank you for the very detailed response, I think it will be very helpful. Most of the steps are similar to what is already implemented for ELF/Linux, basically:
PT_NOTE
trick. For PE/Windows, it seems there would be enough space to extend the section table./proc/
directory to do this. I think it may also be possible to use the DYNAMIC
segment for this, provided the binary is dynamically linked, but this is unimplemented. For Windows, maybe it is possible to use the PEB
to find the binary path?open
, mmap
, etc., are used. For PE/Windows, the Windows equivalents will need to be used (e.g., CreateFile
, MapViewOfFile
, etc.). However, it is not clear to me how to make these system calls. Under Linux, the syscall interface is static, so it is not a problem. Under Windows, it is not static, and it seems the "correct" way to do it is via kernel32.dll
calls (or even the ntdll.dll
equivalents). So how can these dlls and corresponding functions be found by the loader?vm.max_map_count
limit. I am not sure if there are similar limits for PE/Windows. The HANDLE
limit of 16M should not be a problem.To implement this, the e9elf.cpp
file would need to be generalized or reimplemented for PE executables. The E9Tool frontend would also need to be extended too.
Grab a drink, this is a long one.
Inject an extra segment into the binary which contains a loader. For ELF/Linux this currently uses the well-known PT_NOTE trick. For PE/Windows, it seems there would be enough space to extend the section table.
Yup, or even if you do need to move the sections, you just patch the raw addresses on the section table, since everything inside the sections is referenced by RVA, which is not affected by the raw address (offset in the file).
Change the entry point so that the loader is executed first. For ELF/Linux this is tricky since there is are exe/dso cases to consider. Also, there are some dynamic linker extensions that can call the code before the entry point. However, this should all be figured out and the current implementation seems stable. For PE/Windows, I assume the entry point can easily be changed to point to the loader?
All you have to do is change the AddressOfEntryPoint
field in the PE header to point to the RVA of the entry point, which is just the RVA of the new section plus whatever offset the entry point is in that section. So, for example, if you added a new section and gave it an RVA of 0x280000, and the entry point was 0x40 bytes into that section, your AddressOfEntryPoint
field just gets set to 0x280040.
The loader must find the currently executing binary file, so additional data (such as trampolines) can be mapped in. For ELF/Linux it currently uses the /proc/ directory to do this. I think it may also be possible to use the DYNAMIC segment for this, provided the binary is dynamically linked, but this is unimplemented. For Windows, maybe it is possible to use the PEB to find the binary path?
You can get the binary path via the PEB. What you do is you search through PEB->Ldr->InLoadOrderModuleList
for an LDR_DATA_TABLE_ENTRY
structure that has a DllBase
field matching PEB->ImageBaseAddress
. Once you've found that, you can read FullDllName
. Ignore the fact that these fields mention DLLs - it just means "modules" which can be anything executable, including the main executable.
Another (possibly easier) approach is to just call GetModuleFileNameW
, which will give you the path of the main module for the process if you pass NULL to the hModule
parameter.
The loader must also make several system calls. E.g., for ELF/Linux, open, mmap, etc., are used. For PE/Windows, the Windows equivalents will need to be used (e.g., CreateFile, MapViewOfFile, etc.). However, it is not clear to me how to make these system calls. Under Linux, the syscall interface is static, so it is not a problem. Under Windows, it is not static, and it seems the "correct" way to do it is via kernel32.dll calls (or even the ntdll.dll equivalents). So how can these dlls and corresponding functions be found by the loader?
Probably makes sense for me to explain the whole approach here 'cos it requires some knowledge of PEs. All screenshots here are from CFF Explorer, which is a PE editor tool. It's kinda old but it gets the job done. I'm also just looking at a 64-bit executable since 32-bit structures are slightly different.
PEs files start with an old 16-bit DOS header. This header is almost entirely ignored on modern Windows, so the only fields that typically matter are e_magic
(which must be 'MZ') and e_lfanew
, which points to the offset of the NT header.
The e_lfanew
field is always at 0x3C. It tells you the offset of the NT headers. Here's a tree view of the overall structure just to keep the overall layout in your head.
The NT header only actually has one field of its own, which is Signature. This is a 32-bit field set to PE\0\0
, i.e. 0x004550.
You can see that its offset is at 0x108, which is where e_lfanew
said it was. You might notice that there's a bit of a gap between the end of the PE header at 0x40 and the start of the NT header at 0x108.
What sits in that space is the DOS stub. You know the old "This program cannot be run in DOS mode"? That's actually a 16-bit x86 DOS program, stored in the file immediately after the e_lfanew
field, but before the NT header. If you try to run a modern Windows PE under DOS, it runs that program instead of the PE. Since the e_lfanew
field is 32-bit, you can actually embed a complete 16-bit DOS program in there for cross-compatibility!
For fun, here's the stub disassembled:
If you're really eagle-eyed, you might have noticed that the DOS header plus the string it references still doesn't make up the full size of the gap between 0x40 and 0x108. And you'd be right. This executable also happens to contain something called a RICH header, which is a kinda weird self-contained debug blob. Has its uses, but irrelevant here.
Anyway, back to the main topic. After the NT header signature comes the File Header and the Optional Header. These are sequential.
The File Header contains some important fields.
The first is the Machine
field, which tells you what machine this was built for. 0x8664 means x86_64, and 0x14C means x86_32. There are a bunch more defined values but unless you're planning on working with Itanium or ARM PEs I wouldn't worry about it.
Next is the NumberOfSections
field. This tells you how many sections there are in the section table. We'll come to that later.
TimeDateStamp and the symbol table fields can be ignored. SizeOfOptionalHeader
is the next of importance - it tells us how big the next structure is going to be. It should always be 0xF0
on a 64-bit executable.
Finally there's Characteristics
. This is a bitfield that specifies various flags. The flags in here should be irrelevant for your use-case, but flag 0x20 is "image can handle >2GB address space" which, if you ever do 32-bit stuff, will be important because it signifies PAE compatibility, i.e. the ability to have a virtual address space up to 3GB (or sometimes 4GB) in size per 32-bit process. If you're just doing 64-bit, ignore this.
The optional header is where most of the magic happens. It's different between 32-bit and 64-bit programs. I'll focus only on 64-bit.
The first field here tells you which structure to use. 0x020B is PE64, 0x010B is PE32.
The SizeOfCode
and SizeOfInitializedData
fields are the sum of the sizes of the sections that have the "Contains code" and "Contains initialised data" flags respectively. You shouldn't need to update these since if you inject a new section you don't actually have to apply these flags to make it work. I'll get into that later.
AddressOfEntryPoint
is the RVA of the entry point. Notice that CFF has marked it as ".text" next to it, indicating that this RVA points to something in the .text section. This field is what you change to make the PE start executing your own stub.
BaseOfCode
is the RVA of the code section. You shouldn't need to touch this, but it basically just points to the virtual address of the .text section.
ImageBase
is the "preferred" base address of the executable, i.e. if ASLR was disabled the image would be loaded at that virtual address assuming that nothing else was already loaded there. This is largely irrelevant these days, although it does have one weird implementation detail - in order to support full "high entropy" ASLR the specified image base must be in the upper side of the 64-bit virtual address space, i.e. 0x100000000 or higher. You can generally ignore this.
SectionAlignment
is the alignment of the virtual address space for sections. This is usually set to 0x1000, i.e. one page. You are not allowed to specify section start addresses or sizes with smaller granularities than the alignment. So all sections' virtual addresses must start at a multiple of 0x1000.
FileAlignment
is the alignment of the section data in the file. Each section has a "Raw Address" and "Raw Size" field that specifies where its contents are in the PE file. This is how the loader takes code and data from the PE file and puts it in memory ready to be used. The FileAlignment
field must be at least 0x200 on Win10, and it is usually 0x200 anyway. This means that the data for each section in the file must start at an offset that is a multiple of the file alignment, and its size must also be a multiple of the file alignment. Don't worry if this is a bit confusing, you'll see this more clearly later.
SizeOfImage
is supposed to be the size of the image, but it's calculated in a weird way. Basically take the highest virtual address of a section, add the virtual size, and round up to a multiple of SectionAlignment
. Here it's 0x2BF000 + 0xA7D7C = 0x366D7C, which gets you 0x367000. You'll need to update this if you inject a new section.
SizeOfHeaders
is the size of all image headers rounded up to FileAlignment
. Almost always 0x400 in normal executables, but may be 0x200 in packed executables that omit some data directories and have only one section.
CheckSum
is generally ignored. Technically if it is set it to a non-zero value it should be correct, but in practice it doesn't matter. Good practice to zero it or set it correctly if you're modifying a PE, but not strictly necessary.
Subsystem
tells you which subsystem loads the PE. The two you'll run into are 0x0002, which is Windows GUI, and 0x0003, which is Windows Console. You might also see 0x0001, which is Native, i.e. a kernel driver.
DllCharacteristics
tells you a bunch of flags about the executable. 0x40 is dynamic base (ASLR supported), 0x80 is the force integrity flag (related to signing policies), 0x100 is NX compatible (DEP), the rest are irrelevant.
Finally you've got NumberOfRvaAndSizes
. This tells you how many entries there are in the directory table. The directory table tells the loader where certain optional structures are in memory. The directory table covers things like imports, exports, exception tables, digital signatures, relocations, debug info, thread-local storage config, IATs, and .NET metadata for CLR executables.
The data directories table is an array of up to 15 entries, each containing an RVA and size field. The NumberOfRvaAndSizes
field is the number of valid entries in the table, plus one null entry on the end. So for a full table (the norm) it's 16, or 0x10. Normally you won't see any other value than 0x10 in a non-packed executable.
The meaning of each directory is hard-coded by its index, i.e. export = 0, import = 1, resource = 2, exception = 3, etc.
The RVA is the virtual address, relative to the image base, of the location of the data for that directory. These match up with sections, i.e. every directory points to some address in a section, rather than to an offset in the file.
The ones you care about at the Export Directory, Import Directory and the Import Address Table (IAT) Directory. I'll describe these later since it makes more sense to look at sections first.
Immediately after the data directories you have the sections table.
Each entry in the table is called a section header and it has ten fields. A section lays out parts of program memory in virtual address space, telling the loader what page access flags to apply and what content to load.
The first field is the section name, which is an 8 byte string padded with nulls. These are effectively meaningless but the convention is to call the code section .text
, the read-only data section .rdata
, the read-write data section (for storing globals and setting their initial values) .data
, any resources in .rsrc
, and there are others for various other purposes. The names don't do anything, they're just for identification purposes.
Next are the Virtual Size and Virtual Address. These specify where the section should be mapped in memory, relative to the base address of the module, often referred to as an RVA. The virtual address must be aligned to SectionAlignment
. The virtual size can be any number (unaligned), specifying exactly how many bytes from the file must be copied into the memory region. The allocated memory region itself will be rounded up to the nearest page boundary, so if you specify a size of 0x3E4C the section will be 0x4000 bytes in memory, but only the first 0x3E4C bytes will have data written into them.
After that you've got the Raw Size and Raw Address. These specify where the section's data is stored in the file. The raw address and raw size must be aligned to FileAlignment
.
The other fields are unimportant, other than the Characteristics field which has flags about how the section works. The top nibble specifies the page protection flags: 0x2 for executable, 0x4 for readable, 0x8 for executable. There's also the "contains code" flag (0x20) and "contains initialised data" (0x40) flags at the bottom end of the field, which you shouldn't need to care about since they don't actually affect the functionality. The .text
section usually has Characteristics value of 0x60000020, i.e. read+exec, contains code. The .rdata
is readable, contains initialised data. The .data
is read+write, contains initialised data.
You can translate between an RVA and a file offset using the table. Given an RVA, you scan through the section table, find the section that contains that RVA (i.e. RVA >= Virtual Address && RVA < Virtual Address + Virtual Size). You then subtract the RVA from the Virtual Address of the section to get the offset of that address in the section, and add it to the Raw Address to get the file offset.
Here's a code example in Python:
def rva_to_offset(rva):
for section in sections:
if rva >= section.VirtualAddress and rva < (section.VirtualAddress + section.VirtualSize):
sectionOffset = rva - section.VirtualAddress
return section.RawAddress + sectionOffset
return null
You can do the inverse, too, i.e. go from an offset to RVA:
def offset_to_rva(offset):
for section in sections:
if offset >= section.RawAddress and offset < (section.RawAddress + section.RawSize):
sectionOffset = offset - section.RawAddress
return section.VirtualAddress + sectionOffset
return null
You can also convert from a relative virtual address (RVA) to a virtual address (VA) by adding the PE's ImageBase
field value to the RVA. So, for example, an RVA of 0x2000 would have a VA of 0x140002000, given the ImageBase
of 0x140000000 shown in this particular PE. The VA is useful because it specifies the address that would be used in the real running program if the module was loaded at its preferred base address.
That's pretty much it for sections.
The export directory is the really important one for what you want to do here, i.e. make API calls from shellcode.
Normally on Windows your application imports the APIs for you so you don't have to worry about this. This is done via the import table and the IAT. The import table has entries for each DLL you want to import APIs from, then a set of imports for each of those DLLs. When the program starts, the Windows image loader loads the libraries into the program memory space, then finds the required imports and writes the addresses of the API functions into the import address table (IAT).
You can also dynamically load libraries with LoadLibrary
, and then find APIs by name in a loaded library with GetProcAddress
. But the thing is, you need to know where those APIs are in the first place to make those API calls to find other API addresses - a catch 22. However, if you can find those two APIs through some other method, you can use them to easily and reliably load any API you like. Both of those two APIs are in kernel32.dll, to make things a bit easier.
Instead of trying to find the APIs you want via the import directory of the program executable, you can instead find them via the export directory of the DLL that contains the exports you want. As long as you know where that module is in memory, you can find the export directory, and find the APIs!
The export directory is a little simpler than the import directory. It starts with a header:
Name
is the RVA of a null-terminated string that specifies the name of the DLL. If you convert the RVA to an offset you'll find the string in the file there. In this case I'm using kernel32.dll as an example:
NumberOfFunctions
is the number of exported functions, unsurprisingly.
NumberOfNames
is the number of names in the export name table. This can be different to the number of exported functions, because some functions can be exported by ordinal (index in the table) rather than by name.
The export table is effectively three arrays. One for the RVAs to the functions (i.e. RVAs that point to the first instruction of the functions), one for the RVAs to the ordinals, and one for the RVAs to the function names. So for each entry you have a function, an ordinal, and a name.
AddressOfFunctions
is the address of the function RVA list, which is basically an array of RVAs (kinda like pointers, here) to the function implementations.
AddressOfNames
is the same but for RVAs to the name strings. Each is null-terminated.
AddressOfNameOrdinals
is another array that specifies the ordinal of each function. Each is a 16-bit value.
So basically you've got:
void* Functions[NumberOfFunctions];
char_t* Names[NumberOfNames];
uint16_t NameOrdinals[NumberOfNames];
Each function can be accessed by those indices. Functions with no name are importable by their ordinal (index into the array) - not to be confused with a name ordinal, which is different. Don't worry about ordinals too much, they don't come up very often and you don't really care about them here.
So to find a function by its name in the export table, you look at the AddressOfNames
field to get the RVA of the names array, then use that to loop through each of the name RVAs to find the one that matches the name of the API you want. That gives you the index into the other arrays to find where the function is.
For example:
void* getFunction(const char* functionName)
{
for (int n = 0; n < exportDirectory->NumberOfNames; n++)
{
if (strncmp(functionName, exportDirectory->Names[n], peHeader->SizeOfImage) == 0)
{
return exportDirectory->Functions[n];
}
}
return NULL;
}
Keep in mind that this gives you the RVA, so if you want the virtual address you need to add the base address of the module.
Remember that you can find the address of the module you want using the PEB. So let's say you want to find LoadModule
and GetProcAddress
from kernel32.dll at runtime - here's the steps in pseudocode:
PEB* peb = __readgsqword(0x60); // read GS:[0x60] to get PEB pointer
PEB_LDR_DATA* ldr = peb->Ldr;
// start at the first node (the first LIST_ENTRY is in the PEB_LDR_DATA struct, so not valid)
LIST_ENTRY* currentNode = &ldr->InLoadOrderModuleList->Flink;
IMAGE_DOS_HEADER* kernel32_dos = NULL;
do
{
// get LDR_DATA_TABLE_ENTRY after LIST_ENTRY
LDR_DATA_TABLE_ENTRY* entry = (LDR_DATA_TABLE_ENTRY*)(
((uint8_t*)currentNode) + sizeof(LIST_ENTRY)
);
USHORT length = currentNode->BaseDllName->Length;
wchar_t* dllNameStr = currentNode->BaseDllName->Buffer;
// case-insensitive wide string comparison, with length limit
if (wcsnicmp(L"kernel32.dll", dllNameStr, length) == 0)
{
// this is kernel32
kernel32_dos = (IMAGE_DOS_HEADER*)currentNode->DllBase;
break;
}
// not kernel32, try the next module
currentNode = currentNode->Flink;
}
while (currentNode != NULL && currentNode != &ldr->InLoadOrderModuleList);
// did we find kernel32?
if (!kernel32_dos)
return -1;
uint8_t* kernel32_base = (uint8_t*)kernel32_dos;
// find the NT header at the offset specified by e_lfanew
IMAGE_NT_HEADERS64* ntHeader = (IMAGE_NT_HEADERS64*)(
kernel32_base + kernel32_dos->e_lfanew
);
// get the file & PE (optional) headers
IMAGE_FILE_HEADER* fileHeader = &ntHeader->FileHeader;
IMAGE_OPTIONAL_HEADER64* peHeader = &ntHeader->OptionalHeader;
uint8_t* peHeaderBase = (uint8_t*)peHeader;
// data directories are directly after the PE (optional) header.
IMAGE_DATA_DIRECTORY* directories = (IMAGE_DATA_DIRECTORY*)(
peHeaderBase + sizeof(IMAGE_OPTIONAL_HEADER64)
);
// find the sections
size_t sizeOfDirectories = sizeof(IMAGE_DATA_DIRECTORY) * peHeader->NumberOfRvaAndSizes;
IMAGE_SECTION_HEADER* sections = (IMAGE_SECTION_HEADER*)(
peHeaderBase + sizeof(IMAGE_OPTIONAL_HEADER64) + sizeOfDirectories
);
// get the virtual address of the export directory
IMAGE_EXPORT_DIRECTORY* exportDir = (IMAGE_EXPORT_DIRECTORY*)(kernel32_base + directories[0]->RVA);
// get the export arrays
DWORD* nameRVAs = (DWORD*)(kernel32_base + exportDir->AddressOfNames);
DWORD* functionRVAs = (DWORD*)(kernel32_base + exportDir->AddressOfFunctions);
void* fnLoadLibrary = NULL;
void* fnGetProcAddress = NULL;
for (int n = 0; n < exportDir->NumberOfNames; n++)
{
char* name = (char*)(kernel32_base + nameRVAs[n]);
void* func = (void*)(kernel32_base + functionRVAs[n]);
if (strcmp("LoadLibrary", name) == 0)
fnLoadLibrary = func;
if (strcmp("GetProcAddress", name) == 0)
fnGetProcAddress = func;
if (fnLoadLibrary != NULL && fnGetProcAddress != NULL)
break;
}
// did we find the APIs?
if (fnLoadLibrary == NULL || fnGetProcAddress == NULL)
return -2;
// ok, now you've got the address of LoadLibrary and GetProcAddress and you can call them!
Once you've got LoadLibrary
and GetProcAddress
you can just get any API you like, or load any DLL:
HANDLE hKernel32 = LoadLibrary("kernel32.dll");
SOME_FUNCTION_TYPE fnOpenProcess = GetProcAddress(hKernel32, "OpenProcess");
So that's pretty much it!
The loader maps in all the necessary data, then jumps to the original entry point. For ELF/Linux the main problem is the vm.max_map_count limit. I am not sure if there are similar limits for PE/Windows. The HANDLE limit of 16M should not be a problem.
As far as I know there is no limit to the number of mapped sections, if you're doing the mapping dynamically, short of the number of pages you can have in virtual memory at least.
I know this was a long one but hopefully this gets you completely up to speed with PE headers. I recommend having a look through some of the guides online that explain the full format in more detail if you need reference. The MSDN docs are good too.
Let me know if anything was unclear! :)
I ended up turning the above into a blog post because I write far too much 😅
It's mostly what I said here but might be more easily readable as reference if you're working on stuff and want to look back through: https://codeinsecurity.wordpress.com/2021/08/18/a-primer-on-windows-pe-files-and-doing-api-calls-without-knowledge-of-memory-layout/
Thanks again for the detailed write up & I think you have answered most of my questions. I now think a Windows/PE port is feasible and this is something that the project needs to do sooner or later, so I think I will start looking into implementing this. I think however it would still be a lot of work so may take a while, especially since I need to split my time.
No problem at all. If you run into any problems or have questions, just let me know :)
I've built a prototype that injects code into existing PE executables by adding a new section. Some things I (re?)discovered:
notepad.exe
, tar.exe
, etc.), so this trick works at least.MapViewOfFile
. Apparently modern Windows supports the full 65535 sections (not yet untested).GetProcAddress
from kernel32.dll
seems to work fine, although I've only tested something simple (WriteConsole
).STATUS_STACK_BUFFER_OVERRUN
. This seems to stem from a __fastfail
in ntdll.dll
with error code 0xa
, at least according to x64dbg
. However, the error code 0xa
seems undocumented.I believe the STATUS_STACK_BUFFER_OVERRUN
behaviour on ASLR binaries may be related to an export address table access filtering (EAF) exploit mitigation. This mitigation works by marking the export address table of kernel32, kernelbase, and ntdll as a guard page. When you try to access the page, it raises an exception. The EAF handler then handles that exception and checks if the address attempting to access these pages is part of a legitimate module. I would have thought that your code would pass the EAF test, but perhaps it is checking if the section being executed has the "Contains Code" flag set - I am unsure.
This only occurs if EAF is enabled for the process. You can check if a mitigation policy is set for a process (via IEFO) using the Get-ProcessMitigation -Name process.exe
command in Powershell, which will print nothing if there's no MitigationOptions
field set in that process' IFEO, or a full policy readout if one is. The process can also opt-in to mitigation options at startup by calling the SetProcessMitigationPolicy
API. This might be set up by a TLS callback, which would execute before your entry point. If you can get the process to stay open a while (e.g. suspend it) you can use Process Hacker to check the mitigation options, too.
One way to see if it is related to exploit mitigations is to open Event Viewer and check the Operational and WHC logs under Applications and Services Logs -> Microsoft -> Windows -> Windows Defender. For EAF, the event IDs you're looking for are 13 through 18. There's a handy reference for all the exploit guard event IDs here.
If you can share your PoC for injection I can debug it and see if I can figure out what's going wrong.
Thanks for the additional info. I checked and EAF seems to already be disabled. I think the issue is not a big problem for the time being, since disabling ASLR is a workaround. In the long term ASLR should be supported if possible.
Aside from that, there is some good news and bad news.
The good news is that the Windows PE loader fully supports the full 65536 sections. This is ideal for E9Patch, which needs to create large numbers of sections for the trampoline code.
The bad news is that it is really, really slow. For large number of sections (near the max), the PE can take over a minute to load under my Windows 10 virtual machine. It seems Windows is doing something inefficient, like explicitly copying these sections to memory read-for-use, or something like that. I've tried tweaking things, like making the section's offset and address align with the 64KB virtual memory allocation granularity, but this does not seem to help.
An alternative is to load the trampoline code manually in the loader, which is what the Linux version does (since there is no easy way to add new PHDRs to an ELF file). Under my tests, this seems be fast. However, the virtual allocations need to be "close" to the image .text
section, but it seems that some parts of the virtual address space are reserved for various purposes (some are just marked as "reserved" in x64dbg). Is there any detailed information about the virtual address layout of a process, and specifically, what virtual address ranges are guaranteed (or very likely) to be usable by MapViewOfFileEx
?
Sections are mapped as copy-on-write rather than loaded into physical memory. When the PE loader modifies a page (e.g. due to applying base relocations) it causes those pages to be loaded into memory, but that shouldn't happen unless you've got relocations defined for your sections, which I doubt.
I actually wonder whether your crashes with ASLR are related to base relocations. Base relocations are used to modify instructions and data when the module is loaded at a base address other than its preferred address, which usually means ASLR. If you're statically applying trampolines the relocations might be clobbering your addresses. I recently discovered PPEE as a modern alternative to CFF Explorer, and it shows relocations properly - CFF's relocations view just shows you the raw values rather than the proper extracted offsets. It might help you here.
The base relocations table consists of a number of blocks, each with a base RVA that specifies the start of where relocations should be applied for that block. Each entry in the block has an offset from that base RVA, and a type of relocation. Mostly you'll find DIR64 (aka. 'A'), which is a 64-bit base relocation. For each relocation in the block, the loader adds the offset to the base RVA, reads the 64-bit address at that position, then offsets it by the difference between the preferred base address of the module and the actual address it was loaded at.
I'll go through a quick worked example. In the screenshot above you can see that there's a block with base RVA 0x0007F000, which contains a bunch of relocations. The PE's preferred image base is 0x180000000. Let's say it instead gets loaded at 0x390000000. The loader computes the delta between the preferred address and the actual address, i.e. 0x390000000 - 0x180000000 = 0x210000000. The first relocation in the block is at offset 0x008. So the loader adds 0x008 to 0x0007F000 (the base RVA of the relocation block) and gets 0x0007F008. The loader then takes the 64-bit value at RVA 0x0007F008 and adds the base address delta, i.e. 0x210000000.
Here's a hexdump at RVA 0x0007F000:
We can see that 0x0007F008 holds the 64-bit value 0x0000000180086CC8. The loader adds the base address delta to that value, to get the relocated address: 0x180086CC8+ 0x210000000 = 0x390086CC8. The loader then overwrites the original 64-bit value in memory with the new relocated one. This causes the copy-on-write to load the modified page into memory, containing the relocation.
If your trampolines are statically modifying code that is targeted by a base relocation, the code may get turned into garbage. While rip-relative addressing has lessened the need for base relocations, it's still common to find relocations that target the address portion of an instruction, e.g. mov rax, [0x180002d08]
being relocated to some other address. This means that if you overwrite an instruction with some other instruction, it might still have a relocation applied that messes up the resulting assembly.
In terms of address layout, all addresses are up for grabs by default. By convention DLLs are loaded in memory at higher addresses than the main executable module, and memory addresses below the main module are used for the stack and process heap. The reserved pages just mean that memory was allocated with MEM_RESERVED
and not yet committed with MEM_COMMIT
, i.e. some block of addresses in virtual memory has been reserved for future use (e.g. stack or heap expansion) but no actual pages have been committed to back those addresses and nothing can write to that area until the region is committed. Process Hacker's memory tab on process properties should give you some insight here, e.g.:
Another option you have is to inject a single read-write-execute section into the PE, then treat it like your own heap and load whatever you like into it, including trampolines. You can guarantee that it'll be near the PE, because it gets loaded sequentially with the other sections.
As an aside, is control flow guard (CFG) enabled on the executable? You can check to see if the IMAGE_DLLCHARACTERISTICS_GUARD_CF
bit (0x4000) is set in the DllCharacteristics
field on the PE header. CFG can make loading PEs quite slow, since they generate a control-flow bitmap for each section with executable flags, and the performance characteristics can be as bad as O(n^2) just per section. Microsoft fixed some of the startup performance issues but I wouldn't be surprised if they're still affecting you if you have 65k sections.
For my last post, I was testing a PE file I had generated with MinGW. However, it seems the resulting executable has a low base address 0x400000
, and the lower end of the virtual address space seems to be quite polluted. For the Windows native exes I've seen, the base address is 0x140000000
, which is in a more "pristine" part of the address space. So I think the loader+MapViewOfFile
approach is feasible.
Also, I assume it is possible to statically change the image base to some other value? This probably requires using the relocations.
If your trampolines are statically modifying code that is targeted by a base relocation, the code may get turned into garbage.
I have not tried patching any code yet, but this is an important point I had overlooked. I think such code cannot be statically rewritten, or else the base address would need to be fixed (i.e., ASLR disabled). So I think disabling ASLR is the way to go for now (in addition to fixing the other unsolved bug).
CFG can make loading PEs quite slow, since they generate a control-flow bitmap for each section with executable flags
I also tried creating 1000s of sections without IMAGE_SCN_MEM_EXECUTE
, and even with IMAGE_SCN_CNT_UNINITIALIZED_DATA
. I also tried different programs. The result is the same: the PE loader is really slow for some reason. In fact, cmd.exe
/explorer becomes unresponsive until the new process is created, so this is not really a usable solution at it currently stands.
As you say, defining a single section and copying the trampoline code is another idea. However, this will also be slow, and use a lot of physical memory (since there would be no sharing).
So I think the injected loader+MapViewOfFile
is the most feasible approach moving forward. It is fast, and it works provided the virtual address space around the object is "clean". If the base address is a high value, like 0x140000000
, then trampolines can also be placed at negatives offsets. This makes patching coverage a lot higher (this is also true for PIE executables under ELF/Linux).
Also, I assume it is possible to statically change the image base to some other value? This probably requires using the relocations.
I'm not actually sure how problematic this would be. Technically speaking everything in the PE structure should be referenced by RVAs, so moving the preferred base address (ImageBase) doesn't affect any of that. Anything in the code or data that's referenced directly by VA should be covered by a base relocation entry, because if it wasn't then the VA would be invalid when the image gets relocated by ASLR.
The base relocations would become invalid if you just changed ImageBase and did nothing else, because the compiler generates instructions & data based on the ImageBase at compile time. The Windows PE loader calculates the address delta based on the difference between ImageBase and the actual loaded address, so if you statically modify ImageBase then it will be "out of sync" with the actual opcodes and pointers in data segments. So, for example, if ImageBase was 0x140000000 and you changed it to 0x150000000, virtual addresses in the code and data would still be referencing 0x140000000. With ASLR disabled, the image would be loaded at 0x150000000, and no relocations would be applied (because the actual base address is equal to ImageBase) but everything would still point at 0x140000000. To fix this you need to walk the base relocations table and manually apply your own fixups at the RVA of each address. So in this example, where you've moved ImageBase from 0x140000000 to 0x150000000, you'd need to add 0x10000000 to the value at the RVA of each base relocation.
The main thing you'd need to do is walk the base relocations table and manually apply a relocation to the target of each relocation by however much you shifted ImageBase by. So with the example numbers above you'd add 0x10000000 to the 64-bit value at the RVA of each base relocation entry, except entries where the type field is 0x00 (ABSOLUTE) which you just ignore.
I can't think of any other cases where VAs would be in the PE, instead of RVAs, without a base relocation applied. So it should just be a case of patching the relocations.
As an aside, the load config directory does use VAs in its structure, but I've confirmed that base relocations are applied to those fields.
RVA of the load config directory is 0x0016F740:
The load config struct has fields that use VAs as values, rather than RVAs:
(note: Microsoft has added new fields to this struct, but PPEE doesn't print the full up-to-date set of fields, so it stops at DynamicValueRelocTable
even though there are more fields after that)
The base relocations table has a block for that page (0x0016F000):
The offset of the load config directory structure in that page is 0x0016F740 - 0x0016F000 = 0x740.
To help figure out which fields of the load config structure have relocations applied, I dumped the struct offsets for IMAGE_LOAD_CONFIG_DIRECTORY64
and added 0x740 to them:
struct _IMAGE_LOAD_CONFIG_DIRECTORY64 {
/* type field reloc offset, field size */
DWORD Size; /* 0x740 0x4 */
DWORD TimeDateStamp; /* 0x744 0x4 */
WORD MajorVersion; /* 0x748 0x2 */
WORD MinorVersion; /* 0x74a 0x2 */
DWORD GlobalFlagsClear; /* 0x74c 0x4 */
DWORD GlobalFlagsSet; /* 0x750 0x4 */
DWORD CriticalSectionDefaultTimeout; /* 0x754 0x4 */
ULONGLONG DeCommitFreeBlockThreshold; /* 0x758 0x8 */
ULONGLONG DeCommitTotalFreeThreshold; /* 0x760 0x8 */
ULONGLONG LockPrefixTable; /* 0x768 0x8 */
ULONGLONG MaximumAllocationSize; /* 0x770 0x8 */
ULONGLONG VirtualMemoryThreshold; /* 0x778 0x8 */
ULONGLONG ProcessAffinityMask; /* 0x780 0x8 */
DWORD ProcessHeapFlags; /* 0x788 0x4 */
WORD CSDVersion; /* 0x78c 0x2 */
WORD DependentLoadFlags; /* 0x78e 0x2 */
ULONGLONG EditList; /* 0x790 0x8 */
ULONGLONG SecurityCookie; /* 0x798 0x8 */
ULONGLONG SEHandlerTable; /* 0x7a0 0x8 */
ULONGLONG SEHandlerCount; /* 0x7a8 0x8 */
ULONGLONG GuardCFCheckFunctionPointer; /* 0x7b0 0x8 */
ULONGLONG GuardCFDispatchFunctionPointer; /* 0x7b8 0x8 */
ULONGLONG GuardCFFunctionTable; /* 0x7c0 0x8 */
ULONGLONG GuardCFFunctionCount; /* 0x7c8 0x8 */
DWORD GuardFlags; /* 0x7d0 0x4 */
IMAGE_LOAD_CONFIG_CODE_INTEGRITY CodeIntegrity; /* 0x7d4 0xc */
ULONGLONG GuardAddressTakenIatEntryTable; /* 0x7e0 0x8 */
ULONGLONG GuardAddressTakenIatEntryCount; /* 0x7e8 0x8 */
ULONGLONG GuardLongJumpTargetTable; /* 0x7f0 0x8 */
ULONGLONG GuardLongJumpTargetCount; /* 0x7f8 0x8 */
ULONGLONG DynamicValueRelocTable; /* 0x800 0x8 */
ULONGLONG CHPEMetadataPointer; /* 0x808 0x8 */
ULONGLONG GuardRFFailureRoutine; /* 0x810 0x8 */
ULONGLONG GuardRFFailureRoutineFunctionPointer; /* 0x818 0x8 */
DWORD DynamicValueRelocTableOffset; /* 0x820 0x4 */
WORD DynamicValueRelocTableSection; /* 0x824 0x2 */
WORD Reserved2; /* 0x826 0x2 */
ULONGLONG GuardRFVerifyStackPointerFunctionPointer; /* 0x828 0x8 */
DWORD HotPatchTableOffset; /* 0x830 0x4 */
DWORD Reserved3; /* 0x834 0x4 */
ULONGLONG EnclaveConfigurationPointer; /* 0x838 0x8 */
ULONGLONG VolatileMetadataPointer; /* 0x840 0x8 */
ULONGLONG GuardEHContinuationTable; /* 0x848 0x8 */
ULONGLONG GuardEHContinuationCount; /* 0x850 0x8 */
/* in the example PE I'm using (kernelbase.dll) the struct ends here (size = 0x118)
due to it being built with an older version of this struct */
ULONGLONG GuardXFGCheckFunctionPointer; /* 0x858 0x8 */
ULONGLONG GuardXFGDispatchFunctionPointer; /* 0x860 0x8 */
ULONGLONG GuardXFGTableDispatchFunctionPointer; /* 0x868 0x8 */
ULONGLONG CastGuardOsDeterminedFailureMode; /* 0x870 0x8 */
};
So the offsets we care about for the 0x16F000 relocation block are 0x740 to 0x850. Here are the base relocation entries around that range of offsets:
Matching those up with the struct listing above, the relocated fields are:
SecurityCookie
(0x798)GuardCFCheckFunctionPointer
(0x7B0)GuardCFDispatchFunctionPointer
(0x7B8)GuardCFFunctionTable
(0x7C0)GuardAddressTakenIatEntryTable
(0x7E0)GuardEHContinuationTable
(0x848)Notice that this matches all the fields in the load config structure that have values set (so SEHandlerTable
and GuardLongJumpTargetTable
are skipped because they are set to zero), plus GuardEHContinuationTable
which is present but PPEE doesn't support displaying it yet.
The control flow guard tables (checked function pointers, dispatchers, guarded IAT entries, etc.) also have base relocations applied. This is good to know because it means that none of this stuff will be broken as long as you fix up the base relocations after changing ImageBase
in the PE header.
Thank again for taking the time to write this up. I think changing the ImageBase
should be quite feasible, although it will not be a priority initially.
I think the final obstacle is how to handle TLS callbacks. These are a problem since they are called before the entry point, meaning that the .text
section may be executed before the trampolines have been put in place (resulting in a crash). I think the solution will be to replace the first TLS callback with the injected loader code, rather than the entry point.
If that works, the roadmap to Windows PE support is as follows:
Step 1. should be relatively easy now the design is settled. Step 2. is somewhat harder, since E9Tool is currently very ELF specific. Step 3. is similarly hard, since the E9Tool/E9Patch code base is Linux specific. This step could be made optional if porting these tools is too much work (i.e., you can still rewrite Windows binaries, but only using Linux or WSL, similar to "cross compiling"). I think step 4. should be relatively easy, but will probably need ImageBase
adjustment to be implemented first.
An experimental version of E9Patch for Windows is available here:
To try it out (e.g., on calc.exe
), do the following (on Linux):
(copy calc.exe from Windows)
$ ./build.sh
$ ./e9tool -M true -A passthru ./calc.exe -o calc1.exe
Then (on Windows):
(copy calc1.exe from Linux to c:\)
> c:\calc1.exe
It should print some debug information but otherwise run normally.
The current version is for testing only, and is a long way from a usable tool. Some caveats:
I think the final obstacle is how to handle TLS callbacks. These are a problem since they are called before the entry point, meaning that the
.text
section may be executed before the trampolines have been put in place (resulting in a crash). I think the solution will be to replace the first TLS callback with the injected loader code, rather than the entry point.
The TLS callbacks are executed by the loader in the same order as they are listed in the PE structure, and execute synchronously.
When the process starts, each callback is executed synchronously, in order, in the context of the main thread of the process. The dwReason
parameter to the TLS callback is DLL_PROCESS_ATTACH
. This occurs before the main entry point executes, as you know.
When a thread is created, each callback is executed synchronously, in order, in the context of the new thread. The dwReason
parameter to the TLS callback is DLL_THREAD_ATTACH
. This occurs before the entry point of the thread executes. One exception is that DLL_THREAD_ATTACH
is not raised for the main thread, since DLL_PROCESS_ATTACH
is literally just a special-case DLL_THREAD_ATTACH
event for the main thread anyway.
When a thread exits normally (i.e. the thread procedure exits), each callback is executed synchronously, in order, in the context of the exiting thread. The dwReason
parameter to the TLS callback is DLL_THREAD_DETACH
. This occurs after the thread procedure returns, but before the thread is actually disposed of. A key detail, though, is that terminating a thread (i.e. with TerminateThread
) causes the thread to be immediately terminated, skipping the TLS callback.
All of these operations block the thread that the callback relates to, but no other threads, so TLS callbacks can still execute in parallel with code executing in other threads (including other TLS callbacks). As such, you need to be careful about cross-thread accesses. The only exception is DLL_PROCESS_ATTACH
, which is guaranteed to be called when the process only has one running thread, so you don't need to worry about parallel execution.
I think a good approach here is to actually have the entire patcher's setup stub (the code you'd usually have run first by patching the entry point) run from a TLS callback if the executable already has one. Your injected TLS callback would look something like this:
// spinwait lock for patching completion
volatile unsigned long _e9patch_complete = 0;
// statically write the address of the original TLS entry here, so the code knows where to call
volatile PIMAGE_TLS_CALLBACK _e9patch_original_tls = NULL;
void e9patch_tls_callback(PVOID hModule, DWORD dwReason, PVOID pContext)
{
// first check if the e9patch stub has already completed (i.e. _e9patch_complete is nonzero)
// note: InterlockedCompareExchange is an intrinsic, not a Windows API, so it doesn't need resolving.
if (InterlockedCompareExchange(&_e9patch_complete, 0, 0))
{
// e9patch init already completed, so just pass this back to the original callback
_e9patch_original_tls(hModule, dwReason, pContext);
return;
}
if (dwReason == DLL_PROCESS_ATTACH)
{
// set up e9patch, passing a flag to say "don't automatically jump to OEP"
e9patch_init(E9PATCH_FROM_TLS_CALLBACK);
// set the complete flag
InterlockedIncrement(&_e9patch_complete);
}
else
{
// DLL_THREAD_ATTACH or DLL_THREAD_DETACH was somehow hit before the patcher finished
// this shouldn't be possible, so raise an assertion failure (or something like that)
assert(false);
// if it turns out there is some legit reason that this might happen, here's a graceful way to deal with it:
// spinwait for _e9patch_complete before dispatching the original handler.
while (!InterlockedCompareExchange(&_e9patch_complete, 0, 0)) { }
_e9patch_original_tls(hModule, dwReason, pContext);
}
}
This should guarantee that your loader stub executes before anything else, without running into other TLS callbacks, and without breaking the callback you're replacing.
The case where DLL_THREAD_ATTACH
is hit before the DLL_PROCESS_ATTACH
event completes might happen if an external process calls CreateRemoteThread
on your process before e9patch_init
completes. I'm not sure if this is possible - it might stall the CreateRemoteThread
call until the TLS callback on the main thread completes - but you can at least handle this with a spinwait as shown.
Random aside, for debugging: if you want to get the current process ID and thread ID at runtime, before you resolve APIs, you can read them from the TEB.
On x64, offset 0x40 has a CLIENT_ID
struct, which contains a pair of HANDLE
type values (which is just typedef void*
) for the process ID and thread ID. You can read them with __readgsqword
in VC++, e.g.:
uint64_t pid = __readgsqword(0x40);
uint64_t tid = __readgsqword(0x48);
The Microsoft C++ compiler will usually inline GetCurrentProcessId()
and GetCurrentThreadId()
to just read the TEB like this anyway, but this saves you having to worry about resolving exports.
Just a quick update: The current E9Tool/E9Patch tools are stable enough to successfully rewrite programs such as cmd.exe
. The E9Tool passthru
, print
and call
actions have been ported to Windows:
The overall design appears to be validated. The patched binaries can do 1000s of mappings during program initialization with a reasonable delay, and the patched binaries run as expected.
The patched binaries will occasionally trigger a false positive from Windows Defender. This is probably not very surprising. It can be fixed by whitelisting the binary.
Some remaining TODOs:
stdlib.c
to Windows (requires a bit of work). Although call
actions have been implemented, there is currently no useful programming environment, which makes building applications difficult. This can be added later.I've merged the initial cut of Windows PE support, so this issue can be closed. If there are any specific problems, then these can be put in seperate issues.
The support is still experimental and undocumented, but hopefully this will improve over time.
Many thanks to @gsuberland for providing the initial impetus and technical information. This was really helpful.
Thanks for putting the work in and getting this implemented!
Hi, Instrumenting binaries for coverage in Windows in challenging, but your tehcnique is promising so have you planned to add support for it? You can use https://github.com/lief-project/LIEF for instance and have the same code to handle ELF, PE, and Mach-O with small modifications.
Regards, Andrea