Open RedDragonWebDesign opened 4 years ago
Apparently I missed #304 when it was submitted, apologies for that.
It should be possible to improve the decompilation using the features added in 9.1 for decompiling syscalls. The script ResolveX86orX64LinuxSyscallsScript.java
and the system call exercise in ${GHIDRA_HOME}/docs/GhidraClass/Advanced/improvingDisassemblyAndDecompilation.pdf
are a rough outline of what you need to do, but there are a few extra steps. Basically, you'll have to add/change some definitions that we've already added/changed for x64 and gcc.
{$GHIDRA_HOME}/Ghidra/Processors/x86/data/languages/ia.sinc
, you will have to change the pcode implementation of INT imm8
from
{ tmp:1 = imm8; intloc:$(SIZE) = swi(tmp); call [intloc]; }
to
{ tmp:1 = imm8; swi(tmp); }
In 9.1.2, this is on line 3363 of ia.sinc
. You will have to restart ghidra for this change to take effect. The old implementation was written before the syscall-handling stuff. The idea was to model this interrupt as getting a function pointer from "somewhere" and then calling the corresponding function. With the new stuff, the way to implement the pcode is to use the swi
userop (basically, a user-defined pcode instruction with a name but no semantics) which will then be overridden with a CALLOTHER_OVERRIDE_CALL
reference. It's important for the swi
instruction to not have an output if you're going to override it.x86-16.cspec
, analogous to the syscall
calling convention defined in x86-64-gcc.cspec
INT 21
instruction. Using the map from numbers to functions, you will then apply the correct CALLOTHER_CALL_OVERRIDE reference at each instance.That's quite a bit. I would recommend getting an x64 linux libc shared object, running `ResolveX86orX64LinuxSystemCallsScript.java' on it, and observing it as it runs. It does basically everything you need to do, once you're supplied the correct definitions.
As for returning a struct split among multiple registers, you should be able to do this similar to the way the ldiv_t
structure is returned in the "Multiple Storage Locations" exercise in improvingDisassemblyAndDecompilation.pdf
. You'll have to enable custom storage for the function(s). Note that the decompiler might gag a little on code accessing fields of such a structure.
You may want to take a look at #1543, as that's more or less what I started to build.
Thanks for the detailed comments. I'm not advanced enough to write Ghidra code to improve the decompilation of these DOS era interrupts myself.
But if somebody else wrote the engine, I'd be happy to write the data. That is, I could help translate the list of interrupts from one of these websites, into whatever data file or object we need for our code.
Looks like it's fairly complicated. Even a simple interrupt like INT 21h, AH=30h, getDosVersion() has the following complex features:
We can ignore a lot of that complexity and, in the beginning, focus on providing the most common use cases. Then we can let open source/crowdsourcing kick in and people can come along and improve it, as needed.
Here's a rough draft of a schema we could use for our data file/data object.
Maybe @agatti would be interested in expanding his engine to accept a data file/object, and we could add interrupts to the data file over time.
Anyway, just throwing some ideas out there.
Also, having Ghidra comment the interrupts based on a list is also an option. This is what IDA used to do. This could be a good band-aid solution if the above is too time consuming to implement right away.
This would provide search functionality, so you could search for things like "open file", "write file", and other areas of interest.
I can take a look at this in the coming days and extend my old PR with the extra steps mentioned by @ghidracadabra - hopefully I can get something working relatively soon.
Hi. I hope you'll find this script useful: https://gist.github.com/samunders-core/7ba8ea277da974d23f8a5f1cc4734ae2
I couldn't find any progress on this matter.
So I started my own version based on ResolveX86orX64LinuxSyscallsScript.java: https://github.com/Gravelbones/GhidraDosToolbox.
1. in `{$GHIDRA_HOME}/Ghidra/Processors/x86/data/languages/ia.sinc`, you will have to change the pcode implementation of `INT imm8` from `{ tmp:1 = imm8; intloc:$(SIZE) = swi(tmp); call [intloc]; } ` to `{ tmp:1 = imm8; swi(tmp); }`
Is there a reason this change isn't upstream? Also not clear why CALLOTHER_OVERRIDE_CALL can't apply to int imm8, I would have assumed it replaces the existing instructions pcode but instead it crashes the decompiler
We're working on replacing the scripts with a syscall analyzer that will work on many architectures, and this change (or possibly a slightly different change) will be included in that. There's some related discussion in #3936.
CALLOTHER_OVERRIDE_CALL
doesn't change the pcode for the entire instruction, it just changes the opcode of the first CALLOTHER
instruction to that of a CALL
. It's generally better to not have any arguments passed to a CALLOTHER
op you're planning to override in this way (hence the possibility of a change to my previous suggestion).
@ghidracadabra I tried the current ResolveX86orX64LinuxSyscallsScript.java script on ELF 32-bit linux executable. If the script is supposed to show labels in disassembly nothing was found.
There was two syscalls invoked with int 0x80 (sys_write and after that sys_read). IDA can label the write but not the read one out of the box but ghidra doesn't detect anything. Program simply outputs hello world, reads input and exits.
That script was intended as an example for users to customize to their specific use cases, although I grant you that based on the name it really seems like it should just work in your case.
The 32-bit linux syscalls resolved in that script are not called via an int 0x80
but instead are called indirectly through the GS register (see the comments in the script for details). It should be possible to modify that script to work for int 0x80
syscalls but you'd also have to modify the definition of the definition of the int
instruction in ia.sinc
.
We're working on an extension of the syscalls stuff - basically replacing the script with an analyzer that will work for different processors/environments (there's some discussion in https://github.com/NationalSecurityAgency/ghidra/issues/3936) and making all of the necessary changes to the various language modules.
Note that the upstream text files for the interrupt list are provided at https://www.cs.cmu.edu/~ralf/files.html.
Describe the bug In my DOS MZ executable from 1989 (attached), interrupt assembly such as...
Will decompile to code similar to the below.
As you can see, it loses important information, such as
AH=0x30
. That is basically a function number (that tells the interrupt what to do). Some interrupts also use additional registers to pass additional data to the interrupt.I think all interrupts are affected. Below are the interrupts present in the attached executable. There's a variety, including BIOS and DOS interrupts.
Here are some resources for seeing what these interrupts do.
To Reproduce Steps to reproduce the behavior:
swi()
style code it decompiles toExpected behavior
It would be better if this decompiled to code that doesn't lose data. For example, something like...
Or even better, to human readable code.
The returns can be complicated. The above API call has the following returns, so I guess we'd need multiple variables? Or a struct of some kind?
I'm not sure exactly. Just throwing out ideas. Seeing as interrupts are basically defined functions, having them labeled well can give a lot of insight into what the program is doing, and can be a good starting point for reverse engineering.
Screenshots
Attachments gold.zip
Environment (please complete the following information):
Additional context
related #304
If fixing this requires writing a bunch of definitions, I'd be willing to do that. I'd just need one or two examples.