boricj / ghidra-delinker-extension

Ghidra extension for exporting relocatable object files
Apache License 2.0
365 stars 14 forks source link

"Relocation table synthesizer" Analysis False Negative, Symbol Name Confusion, and Comments #6

Closed widberg closed 3 months ago

widberg commented 4 months ago

Following our conversation in #1 I have put together a small test case. The archive delinker_issue.zip contains an executable with a switch statement in main compiled and linked with the Visual Studio 2005 toolchain, the source code for the main function, and delinker artifacts. The "Relocation table synthesizer" misidentifies an absolute operand as not needing a relocation. Granted this is as compared to my model of unlinking with functions as atomic units, you may disagree with what should and should not be relocated.

False negative: The MOVZX at 00401029 has an absolute address operand which is not found to need relocation but should be. The delinker's output says 00401029> No relocation emitted for instruction with interesting primary reference. so it considered it. The target of this operand is the switch statement value table starting at 004010d8.

True positives: The indirect JMP at 00401030 and all of the addresses in the jump table starting at 004010a4 are correctly identified as needing a relocation!

Comment on intra-function unlinking: In my opinion, the relative jumps at 00401008 and 00401027 should not need relocations since they are relative jumps landing within the same function. These labels within a function for branch targets will not move relative to the function start or each other, so it is weird to me to treat them like they are transient. However, it is not wrong to treat them as relocatable as long as the operand width is considered when generating the relocations. Also, without reassembly, if the symbol is defined at an address too far away to fit in the operand, there will be problems. Of course, relative jumps between functions still need to be relocated, which I didn't find any problems with. I don't think you need to change any behavior here, I just wanted to comment on it because it stood out to me.

More granular exports: In some use cases you may not want the entire binary to be exported into a single object file. Exporting individual functions or data elements can be useful in eliminating dead code while relinking, which there will be a lot of if you only want a small portion of the program. Also, I have some rough compiler-specific heuristics for identifying original object file boundaries, which can help identify which functions came from the same source file. So it would be nice to export based on the boundaries I have identified.

A note about switch statement jump and value tables: I consider these tables to belong to the function in which the switch statement that refers to them is located. If you allow unlinking individual functions or groups of functions into individual object files, then these tables should appear in the same object file as the function. As opposed to treating them like their own standalone data elements. I say this because both Ghidra and IDA do not consider these tables in the function's range. So if you were to just use the functions range to determine which bytes get put in the object file you would miss the tables.

Sorry for writing so much, unlinking is just really fun. I'd love to hear your thoughts on everything, there aren't many people to talk with about this kind of thing.

boricj commented 4 months ago

Following our conversation in #1 I have put together a small test case. The archive delinker_issue.zip contains an executable with a switch statement in main compiled and linked with the Visual Studio 2005 toolchain, the source code for the main function, and delinker artifacts. The "Relocation table synthesizer" misidentifies an absolute operand as not needing a relocation. Granted this is as compared to my model of unlinking with functions as atomic units, you may disagree with what should and should not be relocated.

I've designed my tooling around two problems to solve:

The rest will make more sense when viewed in that light.

False negative: The MOVZX at 00401029 has an absolute address operand which is not found to need relocation but should be. The delinker's output says 00401029> No relocation emitted for instruction with interesting primary reference. so it considered it. The target of this operand is the switch statement value table starting at 004010d8.

This is because Ghidra's analyzers miscalculated the reference's target address for that instruction:

00401029    0f b6 90 d8 10 40 00    MOVZX EDX,byte ptr [EAX + 0x4010d8]

The reference's destination should be equal to 0x004010d8, but Ghidra set it to 0x004010d3. To fix this, the reference must be modified by hand to point to the correct address.

My analyzers expect that instructions have correct references and absolute addresses within data are typed as pointers. That design choice is deliberate because trying to guess whether an integer is an address or not is borderline impossible in the general case, therefore they don't try to solve that problem. The downside with this approach is that inaccuracies inside the Ghidra database must be fixed one way or another in order for my delinking tooling to work properly (and Ghidra's analyzers by themselves aren't that accurate).

Note that my x86 analyzer has mostly been tested on aln, a Linux program built in 1994. So far I've focused the bulk of my efforts on MIPS as part of my decompilation project, therefore it's possible that X86CodeRelocationSynthesizer.java may be missing some instruction operand mask patterns. If a x86 instruction with a correct reference couldn't be processed it will probably be because of this, but that isn't the case here.

True positives: The indirect JMP at 00401030 and all of the addresses in the jump table starting at 004010a4 are correctly identified as needing a relocation!

The upside with this approach is that if the contents of the Ghidra database are accurate, it just works. While it takes time to refine it into an acceptable shape, once it's there I've managed to delink hundreds of kilobytes at once without any hacks, kludges or manual annotations.

Comment on intra-function unlinking: In my opinion, the relative jumps at 00401008 and 00401027 should not need relocations since they are relative jumps landing within the same function. These labels within a function for branch targets will not move relative to the function start or each other, so it is weird to me to treat them like they are transient. However, it is not wrong to treat them as relocatable as long as the operand width is considered when generating the relocations. Also, without reassembly, if the symbol is defined at an address too far away to fit in the operand, there will be problems. Of course, relative jumps between functions still need to be relocated, which I didn't find any problems with. I don't think you need to change any behavior here, I just wanted to comment on it because it stood out to me.

While my analyzers do generate relocations for relative jumps, they will be trimmed from the exported object files if both the source and the destination addresses are part of the same address range (this will make more sense in the next point)... and there aren't any shenanigans going on.

Insanity-inducing shenanigans One of the really nasty edge case of the MIPS architecture requires bending these rules. On that platform, assemblers can vacuum up the target instruction of a branch instruction into its delay slot (if it contains a NOP) and move the branch target one instruction forward. This optimization reduces the instruction stream count by one. However, if the original branch target was the high part of a `R_MIPS_HI16`/`R_MIPS_LO16` relocation pair, this effectively _duplicates_ a `HI16` instruction. The `LO16` instruction now has two `HI16` parents and all hell breaks loose because no object file format can represent that. The only way to fix this is to undo the optimization in order to land back on normalized `HI16`/`LO16` relocation pairs, by shifting the branch target back one instruction and ignoring the instruction inside the branch delay slot. I'm doing this by adjusting the addend of the relative branch's relocation if my analyzer detects this situation. That altered relocation is then flagged as mandatory to ensure it appears inside the object file, regardless of the usual relative relocation culling mechanism. Of all the architectures I could've picked to try and delink code from, of course it _had_ to be goddamn MIPS...

More granular exports: In some use cases you may not want the entire binary to be exported into a single object file. Exporting individual functions or data elements can be useful in eliminating dead code while relinking, which there will be a lot of if you only want a small portion of the program. Also, I have some rough compiler-specific heuristics for identifying original object file boundaries, which can help identify which functions came from the same source file. So it would be nice to export based on the boundaries I have identified.

You can export a subset of the program by making a selection inside of the listing view and checking the "Selection Only" checkbox in the export dialog. The selection doesn't have to be contiguous, you can pick and choose whatever you want to export. The only real restriction is that you shouldn't cut across a variable or a function for obvious reasons. You can also restrict the relocation synthesizer analysis with a selection, just make sure you don't try and export bits that weren't analyzed or whose analysis is out of date.

This is why relative relocations will always be generated inside the synthesized relocation table during analysis. It's only during the exportation of an object file that we can determine whether or not to trim them, based on the contiguous address ranges of the selection.

More insanity-inducing shenanigans Note that memory blocks shouldn't cut across anything either. That's not a problem usually, _unless_ the linker was black-out drunk when it created the artifact. Tracking down memory corruption due to a truncated variable is **not** fun and I had to do it twice so far. I should really make the exporter throw an error in that case...

A note about switch statement jump and value tables: I consider these tables to belong to the function in which the switch statement that refers to them is located. If you allow unlinking individual functions or groups of functions into individual object files, then these tables should appear in the same object file as the function. As opposed to treating them like their own standalone data elements. I say this because both Ghidra and IDA do not consider these tables in the function's range. So if you were to just use the functions range to determine which bytes get put in the object file you would miss the tables.

My delinker tooling is completely unopinionated as far as structure goes, not including these bytes in the exportation will create external references just like any other variable. Ultimately, it's up to the user to know what they are exporting. Leveraging program trees to split up the program into semantically useful pieces can help here.

Sorry for writing so much, unlinking is just really fun. I'd love to hear your thoughts on everything, there aren't many people to talk with about this kind of thing.

Don't worry about it, I have the same problem on that topic. Feel free to send me an email, GitHub issues aren't necessarily the best medium for long, in-depth technical discussions.

widberg commented 4 months ago

My analyzers expect that instructions have correct references and absolute addresses within data are typed as pointers. That design choice is deliberate because trying to guess whether an integer is an address or not is borderline impossible in the general case, therefore they don't try to solve that problem. The downside with this approach is that inaccuracies inside the Ghidra database must be fixed one way or another in order for my delinking tooling to work properly (and Ghidra's analyzers by themselves aren't that accurate).

I take IDA's analysis being "perfect" for granted until I use Ghidra for something and it starts missing stuff. I wouldn't expect you to try and solve this on your end. Ghidra just needs to improve some. Altough it is unfortunate since, tracking these down and fixing them manually would take considerable time in a binary the size of mine. Maybe I can find a way to automate this kind of fix and save myself some time.

While my analyzers do generate relocations for relative jumps, they will be trimmed from the exported object files if both the source and the destination addresses are part of the same address range (this will make more sense in the next point)... and there aren't any shenanigans going on.

That is great, sounds like a good solution to me. Those MIPS shenanigans are terrifying though.

You can export a subset of the program by making a selection inside of the listing view and checking the "Selection Only" checkbox in the export dialog. The selection doesn't have to be contiguous, you can pick and choose whatever you want to export.

I did not know that! My last 2 comments in the original post are unnecessary then. Being unopinionated is great, I can use my own analysis to inform which sections to unlink without imposing that structure elsewhere.

This was an informative discussion. I'll close this issue now since it seems that there is nothing that needs to be fixed in this project, but I might re-open it at some point if I run into anything else. While I agree that email is better for technical discussions than GitHub issues, I want more information about unlinking to be indexable by search engines so I'd prefer to keep the discussions out in public. Thats also why I've been continuing to say unlinker, since this project doesn't come up when searching for it. I've probably said unlinker enough at this point and should switch to delinker which is just a better word. Anyway, thanks for the chat! (and the project too, of course)

widberg commented 4 months ago

Hello again. After using the workaround from the Ghidra issue, I gave the delinker another shot and exported a COFF object file (main_obj.zip). One problem I've noticed is that since Ghidra names the switch data something like switchD_00401030::switchdataD_004010a4, when I export the relocation is for the symbol _switchD_00401030__switchdataD_004010d8 but the actual data is given the symbol _switchdataD_004010a4. The same thing happens with the cases switchD_00401030::caseD_5 becomes _switchD_00401030__caseD_5 in the relocation table and _caseD_5 as the code label. So the linker will not be able to correctly corollate the two. Is there a difference in how you get the names for the relocation and the data that is causing this? You had mentioned demangling support, which would be nice but unnecessary for this to work, the names just need to be the same.

boricj commented 4 months ago

I'll look into this when the sun shines in my timezone, but I can give some general comments right away.

In my extension, currently I'd say the relocation analyzers are in good shape, the relocation table model/implementation is meh and the object file exporters are in rough shape. This directly correlates with the number of times each component got refactored. The symbol handling within the exporters is probably the weakest part overall and it hasn't changed much since the days of ghidra-unlinker-scripts. That's just to give a sense of the areas and relative quantities of tech debt laying around here.

The COFF exporter started out as a copy of the ELF exporter that's been hacked down into outputting something that the MSVC toolchain will ingest. The ELF exporter appears to generate a symbol table with different symbol names than the COFF exporter, which suggests bits and pieces may have been lost along the way to COFF.

I'm also using the ELF exporter almost exclusively on my end and merely did some smoke testing with the COFF exporter while I was merging it, so it's nowhere near as battle-tested on my end as the ELF exporter. One thing you can try out in the meantime is exporting to ELF and use the MinGW toolchain which can ingest that object file format, which is what I did the last time I yeeted delinked Linux code onto Windows back when I had no COFF exporter.

widberg commented 4 months ago

It looks like there is a mix of Symbol.getName() and Symbol.getName(true) calls throughout the code. Switching everything to the latter fixed my issue.

widberg commented 4 months ago

Some other stuff I've noticed running the delinker on the full binary:

Ghidra adds hundreds of XRefs to addresses resembling bit masks. In my case 0x007fffff, in every case, it wouldn't make sense for the instruction operand to be an address, like ANDs and XORs. One more thing to be aware of since relocating these would cause issues. Ghidra problem, not much you can do. IDA usually does the right thing in cases like this. It's easy enough to delete the references to addresses that I know would never be accessed.

The delinker doesn't seem to know what to do with Ghidra's label+offset style references. An example is this instruction comparing a single character in a string CMP byte ptr [s_l_009ac588+18],0x4c. Ghidra puts the label at the beginning of the string and represents the character's address as that label plus the number of bytes after the label until the character. It does log 008cc817> No relocation emitted for instruction with interesting primary reference. messages for these sometimes which is nice. Other times it looks like it just uses the symbol for the start of the string for the relocation but still creates a symbol at the offset. I thought I saw some SymbolWithOffset stuff in the extension code, is that supposed to handle stuff like this? There is one of these at 0x00401021 in Test2.exe. Test2.zip

Even after using the workaround and the instructions having the right references I still get 004393a4> No relocation emitted for instruction with interesting primary reference. messages for several instructions which I think should be simple. For example, the instruction MOV EAX,dword ptr [ECX + ECX*0x1 + DAT_00a05858] where the DAT_00a05858 references 0x00a05858 and the portion of the operand making up the address is just the absolute address value 0x00a05858, which is all correct. Does Ghidra have an odd way of representing operands of this form which makes it non-trivial to know which bytes need to be relocated? There is one of these at 0x00401048 in Test2.exe.

All in all, it looks like there are only a few hundred places that need attention which isn't bad for a 7MB binary.

boricj commented 4 months ago

The delinker doesn't seem to know what to do with Ghidra's label+offset style references. An example is this instruction comparing a single character in a string CMP byte ptr [s_l_009ac588+18],0x4c. Ghidra puts the label at the beginning of the string and represents the character's address as that label plus the number of bytes after the label until the character. It does log 008cc817> No relocation emitted for instruction with interesting primary reference. messages for these sometimes which is nice. Other times it looks like it just uses the symbol for the start of the string for the relocation but still creates a symbol at the offset. I thought I saw some SymbolWithOffset stuff in the extension code, is that supposed to handle stuff like this? There is one of these at 0x00401021 in Test2.exe. Test2.zip

I'm not seeing that one on my end. Here's the synthesized relocation table I get on Test2.exe for FUN_00401000, just after running the auto analysis without the x86 Constant Reference Analyzer:

Memory block Location Type Symbol name Addend
.text 00401006 RelativePC LAB_0040100b -1
.text 00401010 Absolute PTR_printf_0040209c 0
.text 00401016 Absolute DAT_00403018 0
.text 0040101b Absolute DAT_004020f4 0
.text 00401024 Absolute s_lo,_World!_0040301b 0
.text 0040102a Absolute DAT_004020f8 0
.text 00401039 Absolute PTR_atoi_004020a4 0
.text 00401051 Absolute DAT_004020fc 0

There is a relocation spot at 00401024 and the reference's destination is 0040301b, without an offset. I've uploaded the results of the auto analysis I get as a Ghidra zip file here: Test2.gzf.zip.

Here's the synthesized relocation table I get with the x86 Constant Reference Analyzer:

Memory block Location Type Symbol name Addend
.text 00401006 RelativePC LAB_0040100b -1
.text 00401010 Absolute PTR_printf_0040209c 0
.text 00401016 Absolute s_Hello,_World!_00403018 0
.text 0040101b Absolute DAT_004020f4 0
.text 00401024 Absolute s_Hello,_World!_00403018 3
.text 0040102a Absolute DAT_004020f8 0
.text 00401039 Absolute PTR_atoi_004020a4 0
.text 00401051 Absolute DAT_004020fc 0

In both cases, the exported ELF object file appears to be self-consistent, although the first case does have a wonky but valid representation of the string due to imperfect analysis. References with offsets should work without issues as far as I remember, but I haven't touched SymbolWithOffset.get() in a very long time.

I'm using Ghidra 11.1.1 and ghidra-delinker-extension v0.4.0.

Even after using the workaround and the instructions having the right references I still get 004393a4> No relocation emitted for instruction with interesting primary reference. messages for several instructions which I think should be simple. For example, the instruction MOV EAX,dword ptr [ECX + ECX*0x1 + DAT_00a05858] where the DAT_00a05858 references 0x00a05858 and the portion of the operand making up the address is just the absolute address value 0x00a05858, which is all correct. Does Ghidra have an odd way of representing operands of this form which makes it non-trivial to know which bytes need to be relocated? There is one of these at 0x00401048 in Test2.exe.

That one is because of an operand mask case that isn't handled inside X86CodeRelocationSynthesizer.java. The mask itself can be seen by right-clicking the instruction and selecting "Instruction Info".

Ghidra doesn't seem to provide a way to access the mask for just the 32-bit displacements within complex x86 addressing modes (but the value itself is exposed as part of the operand object array), which I need in order to identify the relocation spot. I've hand-rolled extra masks to account for the MOD R/M and SIB bytes that can appear inside the operand masks, but I probably didn't encounter that variant back when I delinked the aln executable.

Adding the extra mask for this within X86CodeRelocationSynthesizer.java should be straightforward, although long-term this area of the analyzers could use a refactoring, the way the masks are handled in the code is a bit of a hack. Also, I'll have to add a test case to cover all the various x86 addressing modes exhaustively, instead of just fixing these issues as they come.

widberg commented 4 months ago

I noticed in your Ghidra project the instruction is displayed as MOVSX EAX,byte ptr [s_lo,_World!_0040301b] whereas in mine it is MOVSX EAX,byte ptr [s_lo,_World!_00403018+3]. When exporting as COFF from your project it works as expected. I'm using the same Ghidra and extension versions as you, did you do something to fix the reference in yours? Here is my Ghidra project and the COFF object file produced, Test2_artifacts.zip. The notable reference is at 0x00000421.

boricj commented 4 months ago

I think I've tracked it down to the COFF exporter.

The exporters work on program bytes, so they need to unapply relocations to get relocatable section bytes for the object files.

When the COFF exporter calls relocationTable.getOriginalBytes() right here, it's passing false to the encodeAddend parameter: the first step of masking off the bits targeted by a relocation is done (leaving zeroes), but the second step of patching in the addend is skipped. This effectively corrupts any relocations with non-zero addends.

Passing true to encodeAddend at that location fixes that problem. That parameter is there for ELF, to handle either REL or RELA relocation tables. From what I've read, COFF seems to only support one relocation table format, with addends encoded in.

I really need to beef up the testing suite all around. I don't think there's a test that contains a non-zero addend in there.

widberg commented 4 months ago

Thank you for being so attentive with these issues. I'll try to put together a small test program that demonstrates the issue where it is unable to emit a relocation at all, as opposed to the corrupted one above, for the symbol+offset case, since that seems to be the last problem surfaced by running the extension against my whole target binary.

widberg commented 4 months ago

I can't reproduce the problem in a smaller example for the life of me. So forgive me for the massive Ghidra file too big for GitHub (https://mega.nz/file/MiFWla5L#ARJ-lrsBDxdP_iajbU7GV7BFGMS53jZQT_AbeVl7H70). This issue manifests in several places, but I have selected one function with a few occurrences, FUN_004d8090. These lines show up in the log when running the analyzer.

004d8096> No relocation emitted for instruction with interesting primary reference.
004d80ab> No relocation emitted for instruction with interesting primary reference.
004d80b9> No relocation emitted for instruction with interesting primary reference.
004d80c7> No relocation emitted for instruction with interesting primary reference.

I tried debugging it and found that referenceManager.getReferencesFrom(fromAddress) returns 2 references for the instructions at these addresses and SymbolWithOffset.get(program, reference) returns null for both. I am sorry I can't be more helpful.

boricj commented 4 months ago

That one is because there are no symbols defined at address 0x009ac588. I'm not sure why Ghidra didn't define a symbol at the beginning of this string, but defining one fixes these errors.

The relocation analyzers normalize references to the beginning of what Ghidra calls a CodeUnit. The string from 0x009ac588 to 0x009ac59b is one such CodeUnit and SymbolWithOffset is there to represent references to locations within a CodeUnit. Without a symbol defined at the beginning of the CodeUnit, there is no symbol name to give to SymbolWithOffset and the analyzer bails out.

I'll need to improve the error messaging, one shouldn't have to debug the extension to get a clear reason why the analyzers failed to synthesize a relocation.

widberg commented 4 months ago

Ah that makes sense, thank you. Speaking of error messages, the log gets truncated to ~80 lines even in the application.log file. Is there a way to see the full log? I'm not super familiar with debugging Ghidra extensions.

boricj commented 4 months ago

Speaking of error messages, the log gets truncated to ~80 lines even in the application.log file. Is there a way to see the full log? I'm not super familiar with debugging Ghidra extensions.

Weird, Ghidra normally truncates after 500 lines and leaves a distinctive message in the log if truncation has happened. If that message isn't present, then the analyzer's log hasn't been truncated.

I'm not aware of a setting to change that number, but one way to override it would be to set a breakpoint at the start of the method RelocationTableSynthesizerAnalyzer.added(), run the Relocation table analyzer, then increase the maxSize variable inside the MessageLog instance through the debugger (works on my side using VS Code). Another way would be to modify that constant and recompile Ghidra.

If the number of lines in the log is suspiciously short, then there might be pointers that aren't typed as such inside the data sections, references that are missing inside instructions or functions that aren't disassembled ; things that would make the analyzer outright miss relocation spots to analyze. One way to test for these would be to export the executable as one big object file and relink it at a different base address, so that absolute references to the original address space would trap inside the new address space.

widberg commented 3 months ago

I think I've sorted out my log file confusion, thank you for the help. I'm down to the last few failed relocation messages in the massive binary.

008b26a5> No relocation emitted for instruction with interesting primary reference.
008b2839> No relocation emitted for instruction with interesting primary reference.
008bed15> No relocation emitted for instruction with interesting primary reference.
008beea9> No relocation emitted for instruction with interesting primary reference.

In all of these places, the compiler did a little trick with the switch statement jump table indirect jumps to save subtracting the index register at runtime by subtracting from the disp32 at compile time. Ghidra still figures out what's happening and puts the correct references in, but the delinker has trouble since there is no symbol at the address in the disp32. These only occur in statically linked library functions I don't intend to delink so it's no trouble for me to just ignore them, but it was interesting enough I figured I'd mention it here.

boricj commented 3 months ago

In all of these places, the compiler did a little trick with the switch statement jump table indirect jumps to save subtracting the index register at runtime by subtracting from the disp32 at compile time.

I didn't anticipate this particular scenario while designing the relocation synthesizers, that ties with the symbol handling issues I've mentioned previously. That area is overdue for a proper redesign. Also, I expect that unit tests for these kinds of low-level edge-cases will be written in assembly. All we care about is a Ghidra database exhibiting the issue, so we don't actually have to bother with trying to write a valid program in a compiled language.

As a side note, I've just started a new consulting gig I'll be focusing on in the short-term. I'm still available to help figure out issues, but I don't expect to work on this extension until I'm on vacation, beginning on August 5th.

As a side-side note, I've enabled discussions in this repository and started a topic about extant delinking tooling out there (#7), in case you've missed it.

widberg commented 3 months ago

Congratulations on the consulting gig! I certainly don't expect your time, but I do appreciate it. I've gotten to know my way around the code enough to add hacky patches to keep me going, so I'm not blocked on anything. Thank you for putting together that list of delinkers, I'll make sure to comment any new ones I find.

boricj commented 3 months ago

I've started hot-fixing the various issues you've mentioned (it's all on the master branch) and the COFF addend encoding happens to be trickier than expected.

On x86, direct calls and jumps are relative to the next instruction. This means that the addend inside the relocation table must be non-zero (typically -4 for a standard CALL instruction). On ELF, that addend is directly encoded as 0xFFFFFFFC in the section bytes and everything's fine. On COFF , it's encoded as 0x00000000. As far as I can tell, the size of the relocation patch is subtracted from the addend before applying it.

On top of that, the encodeAddend set to false bug happens to compensate this bug because the target is cleared out without patching in the addend, leaving the target with 0x00000000. So setting encodeAddend to true breaks direct calls and jumps to external symbols because of that pesky -4 implicit offset.

Therefore, object file formats will not necessarily agree on the contents of unrelocated section bytes and I did not anticipate that quirk when designing my extension... Right now I've kludged it by piping a flag from the COFF object file exporter all the way to RelocationRelativePC to tell it to adjust the target value just before unapplying it, but I'll have to properly refactor that part at some point.

widberg commented 3 months ago

Thank you for the fixes, they have improved things greatly. My relinked game makes it most of the way through the loading process before getting caught in an infinite loop on one of the threads. I haven't quite tracked down the cause, but I do know where it is getting stuck. While taking a break from solving that issue, I searched for other problems and found two.

When exporting the same binary into multiple object files, i.e. split code and data, and trying to relink them I run into a bunch of unresolved external symbol errors since it looks like most data is given the storage class IMAGE_SYM_CLASS_STATIC which has local scope and is not visible to the object file with the code. Using the storage class IMAGE_SYM_CLASS_EXTERNAL for data seems to fix this.

For some reason when relinking, the jump at 0x008e8e40 in Massive2.gzf (I changed some stuff since the last upload) turns into E9 00000000 and does not appear in the COFF file's relocation table. This is odd since it does look like it is in the relocation table in Ghidra. I suspect something wrong on my end with how I'm relinking, but I figured I should mention it here in case you found something I'm missing in that area.

I have been using this delink_exec.py script to invoke the delinker with my address ranges.

widberg commented 3 months ago

Quick update. Renaming duplicate symbols in the Ghidra database seems to have fixed the infinite loop. I think several of the switch case labels had the same names as other labels in the function which caused the looping. The two problems I described above are still happening though.

boricj commented 3 months ago

I've overhauled the symbol handling in the exporters. I've done it to get more control on the names of the symbols included in the object file (there's a symbol name preference option now to make exporters prioritize secondary names if applicable), but that might indirectly also help highlight issues within a Ghidra database on your end.

As a side-effect due to the new implementation, a human-readable error will now be thrown on export if two primary symbols with identical names within the same section are included in the selection. The synthesized relocation table works with primary symbol names and Ghidra doesn't enforce their uniqueness. That should prevent infinite loops due to thunks sharing the same name as the real functions and probably other similar issues that were silently ignored before.

Duplicates are still tolerated for external symbols as I do not expect issues in that case. Right now, duplicates across sections won't trigger that error either (it's per-section currently), but that's less likely to occur.

It's a big enough step that I'm planning on tagging a new release at the end of the week. I'll let you test and report issues (if any) with this new implementation in the meantime.

widberg commented 3 months ago

I've spent some time playing with the latest version and it works great! The only hack I have left in my local copy is adding the Ghidra symbol names as debug symbols to make it easier to correlate stuff between the relink and the original since things move around quite a bit with the amount I have to cut out. I am still seeing that weird zeroed out address operand after relinking that I mentioned in this comment. Also, there is the occasional graphical glitch with some nonsense quads showing up after playing the game for a while. These two problems are probably due to errors in my Ghidra database not problems with the extension. Since all the problems I was having with the extension are fixed I'll close this issue. If I figure out what's going wrong with the last two things I'll let you know. Thank you for all the work you have put into this for me! I agree that a release is appropriate with all the improvements made.

widberg commented 3 months ago

I think I tracked down the problems in my database/export technique. Splitting contiguous ranges in the export selection into separate object files instead of one big one solved the zeroed-out operand problem somehow, I don't really understand what the problem was but it is solved now. Finding all the instructions in the export selection that were not in a function and adding them to functions seems to have fixed the graphical glitches although it is hard to tell since these glitches were not very reproducible. Since the extension only considers instructions in functions when computing relocations, any undefined functions or labeled code other than a function was not being delinked. Changing the base address probably should have caused this code to crash when executing but I was getting unlucky and a DLL loaded in the original executable address space and calling into whatever addresses it was didn't crash, it just resulted in some weird visuals. Funny enough I had already defined all the undefined functions and added the standalone code to functions previously but some other processing I had done undid some of that work without my noticing.

boricj commented 3 months ago

I am still seeing that weird zeroed out address operand after relinking that I mentioned in this comment.

Splitting contiguous ranges in the export selection into separate object files instead of one big one solved the zeroed-out operand problem somehow, I don't really understand what the problem was but it is solved now.

I wonder if it has something to do with the fact that IMAGE_SECTION_HEADER.NumberOfRelocations is a 16-bit value, maybe the COFF exporter doesn't properly handle having more than 65535 relocations inside a section. Splitting the exportation across multiple smaller COFF object files could incidentally work around that potential issue.

Finding all the instructions in the export selection that were not in a function and adding them to functions seems to have fixed the graphical glitches although it is hard to tell since these glitches were not very reproducible. Since the extension only considers instructions in functions when computing relocations, any undefined functions or labeled code other than a function was not being delinked.

I'll probably have to revisit the assumption that all instructions are inside functions at some point. It's a reasonable one so far for compiled code written in C or a similar high-level language (although subroutines and naked instructions are a thing in Ghidra), but not for assembly code in general.

Changing the base address probably should have caused this code to crash when executing but I was getting unlucky and a DLL loaded in the original executable address space and calling into whatever addresses it was didn't crash, it just resulted in some weird visuals.

Ooh nasty, I didn't expect that possibility. Blacklisting the address ranges from the original address space one way or another would prevent that from happening. That'll be useful to know when I mess with a platform with a working MMU again.

boricj commented 3 months ago

I am still seeing that weird zeroed out address operand after relinking that I mentioned in this comment.

Splitting contiguous ranges in the export selection into separate object files instead of one big one solved the zeroed-out operand problem somehow, I don't really understand what the problem was but it is solved now.

I wonder if it has something to do with the fact that IMAGE_SECTION_HEADER.NumberOfRelocations is a 16-bit value, maybe the COFF exporter doesn't properly handle having more than 65535 relocations inside a section. Splitting the exportation across multiple smaller COFF object files could incidentally work around that potential issue.

I've figured it out. The extended relocation count located within the first relocation entry of the relocation table didn't account for itself, resulting in an off-by-one error. I've fixed it in boricj/ghidra-delinker-extension@f47c58c4f62b3718ad9e0c4886aba450662f793e. Hopefully that didn't break anything else, the COFF exporter could use some more tests on the wonkiest parts of that object file format...

widberg commented 3 months ago

That fixed it! At this point, I'm not aware of any issues whatsoever in the relinked executable which is impressive given the size and complexity of it. Thank you again for all the help. You have a really cool project here and I hope more people find it.