fonic / wcdatool

Watcom Disassembly Tool (wcdatool) - Tool to aid disassembling DOS applications created with the Watcom toolchain
69 stars 7 forks source link

MK1.EXE missing/incorrect function and/or branch labels #15

Open palmerj opened 2 weeks ago

palmerj commented 2 weeks ago

I’ve noticed that several function or branch names are missing, with the previous symbol name being propagated downwards. For example, in the decompilation of MK1.EXE, within MAIN.ASM, the DO_A11_BACKGROUND symbol seems to extend beyond its intended scope. Specifically, DO_A11_BACKGROUND should represent more than one branch, but it continues to be used down to DO_A11_BACKGROUND_branch_41, causing the symbols for branches below it to be lost.

Could this be an issue with the decompilation tool, or is it possible that the compiler optimised these symbols out, preventing them from being included? I did a quick search for strings in the executable and suspect it might be the latter, but I’d appreciate your insights.

Thanks in advance for your help!

fonic commented 2 weeks ago

Hi there,

you kind of guessed right. It's not so much that names are missing, it's more that they aren't there to begin with.

Lets stick to MK1.EXE as an example: In MK1.EXE_wdump_output_plain.txt, starting with Global Info (section 0), you can see what the MK1.EXE executable, more specifically the debug info it contains, provides. That's what wcdatool uses as a basis. It then processes the code from offset 0 to the end and labels branches based on the last known named label from the debug info. That's how labels like DO_A11_BACKGROUND_branch_1 are generated.

Also, named labels (i.e. those from debug info) only come with a start offset, but not with an end offset (or a size indicator). Thus, there is no reliable information available regarding how long function DO_A11_BACKGROUND actually is. When you say it "extends beyond its scope", I guess you are referring to offset 0xcfbf after DO_A11_BACKGROUND_branch_3 which seems like a natural end. That could well be, but then there is no label in the debug info for the offset that follows, 0xcfc0, thus I'd assume that this is actually not the end of that function. But only the original sources could tell the real story - I would love to have access to those to have a better understanding and a base for comparison, but sadly they haven't surfaced yet.

To cut a long story short: Sadly, debug info only gets you so far, and in the end lots of heuristics and guesses are required when decompiling/disassembling. But I'd say what wcdatool outputs is pretty good (as in browseable/readable) compared to what other tools that do not know how to interpret that old debug info spit out.

fonic commented 2 weeks ago

Could this be an issue with the decompilation tool, or is it possible that the compiler optimised these symbols out, preventing them from being included? I did a quick search for strings in the executable and suspect it might be the latter, but I’d appreciate your insights.

To quickly answer that: 1) It's not a failure of the tool per se, it just uses what's available and 'smart-guesses' what's missing. 2) Debug info is usually configurable, so the Watcom compiler might provide some switches to specify what ends up in the debug info and what is omitted (I never investigated that). In general, compilers themselves don't care for labels at all, this is purely for the benefit of humans. Meaning that if it were up to the compiler, no debug info would be generated at all as it is not needed for software execution later on.

palmerj commented 2 weeks ago

Thank you for your detailed reply—much appreciated. So it seems the issue might be related to the compile flags they used. It’s a shame about MK1, as the full arcade source has leaked for that version yet. On the other hand, MK2 doesn’t appear to suffer from the same reduction in symbol information. When I look at mains.mk2.asm, it seems to retain many more function names, possibly all of them.

fonic commented 2 weeks ago

Thank you for your detailed reply—much appreciated.

You're welcome. It's always an interesting topic to talk about.

So it seems the issue might be related to the compile flags they used.

Well, maybe. Hard to say. The debug info of MK1.EXE contains named labels for all existing functions - but only for actual functions (as in beginnings of structured code blocks). I'd assume the labels within functions were either omitted due to a configuration setting -or- only labels with a certain 'decorator' (what we would call it nowadays) got picked up and transferred to debug info in the first place. That's my best guess.

With MK1, there is also one major difference: it was originally written in Assembler, not C. It was a direct port of the arcade version, which can be deducted from object 2, which contains variables that emulate the arcade machine's register set (offset 0x24b40) via RAM. Due to that fact, the compiler knew substantially less about the original code's structure and context, as Assembler is more of an afterthought for C compilers like Watcom.

It’s a shame about MK1, as the full arcade source has leaked for that version yet. On the other hand, MK2 doesn’t appear to suffer from the same reduction in symbol information. When I look at mains.mk2.asm, it seems to retain many more function names, possibly all of them.

MK2, iirc, was written in C, thus the compiler had much more context information to work with and thus to put into debug info. That's why MK1 and MK2 are fundamentally different in terms of disassembling.

palmerj commented 2 weeks ago

I've taken a closer look at the decompilation output, and it’s clear that many functions are not correctly named and they are they're not just labels for branches. However, these misnamed functions seem to be internal to the compiled module. It appears that only the function symbols with linkages to other compiled modules retain their names.

For instance, here are the function symbols in MAIN.ASM that are correctly named:

P1_START_BUTTON P2_START_BUTTON POST_CHOP_ENTRY WHO_IS_NEXT DO_A11_BACKGROUND INIT_PLAYER_VARIABLES FLASH_PMSG OFFSET_TO_OCHAR AMODE_DEMO_GAME REPELL

All of these functions are either called or jumped to from other modules.

MK2, iirc, was written in C, thus the compiler had much more context information to work with and thus to put into debug info. That's why MK1 and MK2 are fundamentally different in terms of disassembling.

Are you sure that core of it is written in C? When I look at the decomp of mains.mk2.asm it very much seems like MK1 style port from TMS34010 to x86 ASM. I would have thought that the variable names and ASM structure would different if it was compiled C

fonic commented 2 weeks ago

I've taken a closer look at the decompilation output, and it’s clear that many functions are not correctly named and they are they're not just labels for branches. However, these misnamed functions seem to be internal to the compiled module. It appears that only the function symbols with linkages to other compiled modules retain their names.

Which functions do you consider to be named incorrectly and what makes you come to that conclusion?

MK2, iirc, was written in C, thus the compiler had much more context information to work with and thus to put into debug info. That's why MK1 and MK2 are fundamentally different in terms of disassembling.

Are you sure that core of it is written in C? When I look at the decomp of mains.mk2.asm it very much seems like MK1 style port from TMS34010 to x86 ASM. I would have thought that the variable names and ASM structure would different if it was compiled C

I might have been thinking about a different project. It's been several years since had a good look at MK disassembly (started this project over 10 years ago). Looking at it again now, MK2 also was Assembler (e.g. same A2-A14 + B0-B14 register emulation in RAM), albeit with a very different structure than MK1 (e.g. code in data object).

palmerj commented 2 weeks ago

Which functions do you consider to be named incorrectly and what makes you come to that conclusion?

There are too many to list, but one example is WHO_IS_ALONE. This function is actually very simple and short:

WHO_IS_ALONE:
    c9de:   f7 05 c8 4b 02 00 ff ff ff ff   test   DWORD PTR ds:@obj2:P1_STATE,0xffffffff              ; aliases: P1_STATE, SWSTST; fixup: num: 2266, src obj: 1, src ofs: 0xc9e0, dst obj: 2, dst ofs: 0x24bc8
    c9e8:   75 0e                   jne    WHO_IS_ALONE_branch_1
    c9ea:   f7 05 2c 4c 02 00 ff ff ff ff   test   DWORD PTR ds:@obj2:P2_STATE,0xffffffff              ; fixup: num: 2267, src obj: 1, src ofs: 0xc9ec, dst obj: 2, dst ofs: 0x24c2c
    c9f4:   75 02                   jne    WHO_IS_ALONE_branch_1
    c9f6:   f9                      stc    
    c9f7:   c3                      ret    
WHO_IS_ALONE_branch_1:
    c9f8:   f7 05 c8 4b 02 00 ff ff ff ff   test   DWORD PTR ds:@obj2:P1_STATE,0xffffffff              ; aliases: P1_STATE, SWSTST; fixup: num: 2268, src obj: 1, src ofs: 0xc9fa, dst obj: 2, dst ofs: 0x24bc8
    ca02:   75 13                   jne    WHO_IS_ALONE_branch_2
    ca04:   f7 05 2c 4c 02 00 ff ff ff ff   test   DWORD PTR ds:@obj2:P2_STATE,0xffffffff              ; fixup: num: 2149, src obj: 1, src ofs: 0xca06, dst obj: 2, dst ofs: 0x24c2c
    ca0e:   74 1a                   je     WHO_IS_ALONE_branch_3
    ca10:   b8 02 00 00 00          mov    eax,0x2
    ca15:   eb 15                   jmp    WHO_IS_ALONE_branch_4
WHO_IS_ALONE_branch_2:
    ca17:   f7 05 2c 4c 02 00 ff ff ff ff   test   DWORD PTR ds:@obj2:P2_STATE,0xffffffff              ; fixup: num: 2150, src obj: 1, src ofs: 0xca19, dst obj: 2, dst ofs: 0x24c2c
    ca21:   75 07                   jne    WHO_IS_ALONE_branch_3
    ca23:   b8 01 00 00 00          mov    eax,0x1
    ca28:   eb 02                   jmp    WHO_IS_ALONE_branch_4
WHO_IS_ALONE_branch_3:
    ca2a:   33 c0                   xor    eax,eax
WHO_IS_ALONE_branch_4:
    ca2c:   f8                      clc    
    ca2d:   c3                      ret   

However, the WHO_IS_ALONE label continues incorrectly until WHO_IS_ALONE_branch_23. For example, WHO_IS_ALONE_branch_5 should actually be labelled GAME_FINISHED, and further down, WHO_IS_ALONE_branch_8 should be GAME_OVER. I know this because MK2 has the function name and was built on the MK1 source, and the instructions and calls align with these functions.

fonic commented 2 weeks ago

I see what you mean. Interesting, nice find.

I never did do a comprehensive deep-dive into MK's disassembly, as my attention at some point shifted towards developing my then MK-only disassembly tool further into a multi-purpose one (which is what wcdatool is now). I planned on resuming code analysis later on, but then a couple of years later actual MK sources surfaced (PlayStation port, iirc), thus further probing the disassembly became a lot less appealing.

From a quick look at my notes from back then, I can tell that at some point I hypothesized that MK1's debug info might either be corrupted, truncated/incomplete or limited to exported symbols only (which is essentially the same as your "only the function symbols with linkages to other compiled modules" assumption).

So yes, those labels can be considered incorrect, but there is not a lot that could be done to improve the tool's handling of those (other than resorting to labeling them as unkown_branch_<index> or similar).


Be aware though that wcdatool was designed with manual intervention/correction in mind: Simply edit MK1's hints file Hints/MK1.EXE.txt and add a Global Info section containing the missing symbols, e.g.

                           Global Info (section 1)
==============================================================================

  Name:  GAME_FINISHED
    address      = 0001:0000ca2e
    module index = 12
    kind:          (code)
  Name:  GAME_OVER
    address      = 0001:0000cb9e
    module index = 12
    kind:          (code)

(see output file MK1.EXE_wdump_output_plain.txt for correct syntax). Make sure to increment the section index, otherwise the original section gets overridden.

palmerj commented 2 weeks ago

Do you have an email that I can contact you?

fonic commented 2 weeks ago

Further discussion moved to email.