NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
51.62k stars 5.87k forks source link

Better decompilation of DOS era interrupts #2266

Open RedDragonWebDesign opened 4 years ago

RedDragonWebDesign commented 4 years ago

Describe the bug In my DOS MZ executable from 1989 (attached), interrupt assembly such as...

MOV AH,0x30
INT 0x21

Will decompile to code similar to the below.

pcVar8 = (code *)swi(0x21);
bVar11 = (*pcVar8)();

As you can see, it loses important information, such as AH=0x30. That is basically a function number (that tells the interrupt what to do). Some interrupts also use additional registers to pass additional data to the interrupt.

I think all interrupts are affected. Below are the interrupts present in the attached executable. There's a variety, including BIOS and DOS interrupts.

image

Here are some resources for seeing what these interrupts do.

To Reproduce Steps to reproduce the behavior:

Expected behavior

It would be better if this decompiled to code that doesn't lose data. For example, something like...

pcVar8 = interrupt_0x21_0x30([args for that specific interrupt]);
// or
pcVar8 = dos_api_call(0x21, 0x30, [args]);

Or even better, to human readable code.

dosVersion = get_dos_version();

The returns can be complicated. The above API call has the following returns, so I guess we'd need multiple variables? Or a struct of some kind?

Return:
AL = major version number (00h if DOS 1.x)
AH = minor version number
BL:CX = 24-bit user serial number (most versions do not use this)
---if DOS <5 or AL=00h---
BH = MS-DOS OEM number (see #01394)
---if DOS 5+ and AL=01h---
BH = version flag

I'm not sure exactly. Just throwing out ideas. Seeing as interrupts are basically defined functions, having them labeled well can give a lot of insight into what the program is doing, and can be a good starting point for reverse engineering.

Screenshots

image

image

Attachments gold.zip

Environment (please complete the following information):

Additional context

related #304

If fixing this requires writing a bunch of definitions, I'd be willing to do that. I'd just need one or two examples.

ghidracadabra commented 4 years ago

Apparently I missed #304 when it was submitted, apologies for that.

It should be possible to improve the decompilation using the features added in 9.1 for decompiling syscalls. The script ResolveX86orX64LinuxSyscallsScript.java and the system call exercise in ${GHIDRA_HOME}/docs/GhidraClass/Advanced/improvingDisassemblyAndDecompilation.pdf are a rough outline of what you need to do, but there are a few extra steps. Basically, you'll have to add/change some definitions that we've already added/changed for x64 and gcc.

  1. in {$GHIDRA_HOME}/Ghidra/Processors/x86/data/languages/ia.sinc, you will have to change the pcode implementation of INT imm8 from { tmp:1 = imm8; intloc:$(SIZE) = swi(tmp); call [intloc]; } to { tmp:1 = imm8; swi(tmp); } In 9.1.2, this is on line 3363 of ia.sinc. You will have to restart ghidra for this change to take effect. The old implementation was written before the syscall-handling stuff. The idea was to model this interrupt as getting a function pointer from "somewhere" and then calling the corresponding function. With the new stuff, the way to implement the pcode is to use the swi userop (basically, a user-defined pcode instruction with a name but no semantics) which will then be overridden with a CALLOTHER_OVERRIDE_CALL reference. It's important for the swi instruction to not have an output if you're going to override it.
  2. You might need to define a new calling convention for the interrupts in x86-16.cspec, analogous to the syscall calling convention defined in x86-64-gcc.cspec
  3. You will need to define an artificial overlay address space where you want to called functions to live. You don't have to write any code for them, but you need some kind of address as a place to store function signatures and to get cross references.
  4. You will have to determine the names and signatures of all of the functions that might be called and define each function somewhere in the overlay you've created. Using a function's number as the offset in the overlay space is a reasonable thing to do.
  5. You will have to determine the map from function numbers to functions. Then, using the symbolic propagator, you will have to figure out the value in AH at each instance of the INT 21 instruction. Using the map from numbers to functions, you will then apply the correct CALLOTHER_CALL_OVERRIDE reference at each instance.

That's quite a bit. I would recommend getting an x64 linux libc shared object, running `ResolveX86orX64LinuxSystemCallsScript.java' on it, and observing it as it runs. It does basically everything you need to do, once you're supplied the correct definitions.

As for returning a struct split among multiple registers, you should be able to do this similar to the way the ldiv_t structure is returned in the "Multiple Storage Locations" exercise in improvingDisassemblyAndDecompilation.pdf. You'll have to enable custom storage for the function(s). Note that the decompiler might gag a little on code accessing fields of such a structure.

agatti commented 4 years ago

You may want to take a look at #1543, as that's more or less what I started to build.

RedDragonWebDesign commented 4 years ago

Thanks for the detailed comments. I'm not advanced enough to write Ghidra code to improve the decompilation of these DOS era interrupts myself.

But if somebody else wrote the engine, I'd be happy to write the data. That is, I could help translate the list of interrupts from one of these websites, into whatever data file or object we need for our code.

Looks like it's fairly complicated. Even a simple interrupt like INT 21h, AH=30h, getDosVersion() has the following complex features:

image

We can ignore a lot of that complexity and, in the beginning, focus on providing the most common use cases. Then we can let open source/crowdsourcing kick in and people can come along and improve it, as needed.

Here's a rough draft of a schema we could use for our data file/data object.

image

Maybe @agatti would be interested in expanding his engine to accept a data file/object, and we could add interrupts to the data file over time.

Anyway, just throwing some ideas out there.

RedDragonWebDesign commented 4 years ago

Also, having Ghidra comment the interrupts based on a list is also an option. This is what IDA used to do. This could be a good band-aid solution if the above is too time consuming to implement right away.

This would provide search functionality, so you could search for things like "open file", "write file", and other areas of interest.

image

agatti commented 4 years ago

I can take a look at this in the coming days and extend my old PR with the extra steps mentioned by @ghidracadabra - hopefully I can get something working relatively soon.

samunders-core commented 4 years ago

Hi. I hope you'll find this script useful: https://gist.github.com/samunders-core/7ba8ea277da974d23f8a5f1cc4734ae2

Gravelbones commented 2 years ago

I couldn't find any progress on this matter.

So I started my own version based on ResolveX86orX64LinuxSyscallsScript.java: https://github.com/Gravelbones/GhidraDosToolbox.

ariscop commented 2 years ago
1. in `{$GHIDRA_HOME}/Ghidra/Processors/x86/data/languages/ia.sinc`, you will have to change the pcode implementation of `INT imm8` from
   `{ tmp:1 = imm8; intloc:$(SIZE) = swi(tmp); call [intloc]; } `
   to
   `{ tmp:1 = imm8; swi(tmp); }`

Is there a reason this change isn't upstream? Also not clear why CALLOTHER_OVERRIDE_CALL can't apply to int imm8, I would have assumed it replaces the existing instructions pcode but instead it crashes the decompiler

ghidracadabra commented 2 years ago

We're working on replacing the scripts with a syscall analyzer that will work on many architectures, and this change (or possibly a slightly different change) will be included in that. There's some related discussion in #3936.

CALLOTHER_OVERRIDE_CALL doesn't change the pcode for the entire instruction, it just changes the opcode of the first CALLOTHER instruction to that of a CALL. It's generally better to not have any arguments passed to a CALLOTHER op you're planning to override in this way (hence the possibility of a change to my previous suggestion).

peltax commented 2 years ago

@ghidracadabra I tried the current ResolveX86orX64LinuxSyscallsScript.java script on ELF 32-bit linux executable. If the script is supposed to show labels in disassembly nothing was found.

There was two syscalls invoked with int 0x80 (sys_write and after that sys_read). IDA can label the write but not the read one out of the box but ghidra doesn't detect anything. Program simply outputs hello world, reads input and exits.

ghidracadabra commented 2 years ago

That script was intended as an example for users to customize to their specific use cases, although I grant you that based on the name it really seems like it should just work in your case.

The 32-bit linux syscalls resolved in that script are not called via an int 0x80 but instead are called indirectly through the GS register (see the comments in the script for details). It should be possible to modify that script to work for int 0x80 syscalls but you'd also have to modify the definition of the definition of the int instruction in ia.sinc.

We're working on an extension of the syscalls stuff - basically replacing the script with an analyzer that will work for different processors/environments (there's some discussion in https://github.com/NationalSecurityAgency/ghidra/issues/3936) and making all of the necessary changes to the various language modules.

SamB commented 1 year ago

Note that the upstream text files for the interrupt list are provided at https://www.cs.cmu.edu/~ralf/files.html.