Feature Request: Add support for unified syntax (UAL) in v4t (and possibly v5te)

WhenGryphonsFly commented 4 months ago

I am currently involved in decompiling Pokémon Pinball: Ruby and Sapphire for the Game Boy Advance. Despite using the v4t instruction set, we use unified syntax for disassembled functions. After a cursory look at other pret projects, the use of unified syntax is common among the GBA projects but not the DS projects.

As a second point of reference, I checked the GBADev Discord server (where I mostly lurk rather than contribute). The general sentiment appears to be that unified syntax is preferred to divided syntax, on the grounds that (a) it is the same between Arm and Thumb, (b) ARM explicitly allows it for pre-v6t2 processors, and (c) it fixes an issue with the condition in some mnemonics. (For example, the Load Register Halfword opcode is LDRH<cond> in unified syntax, but is LDR<cond>H in divided syntax.)

Given that it is somewhat common - or at the very least not unheard of - to use unified syntax for GBA projects, I would appreciate the ability to disassemble v4t code using unified syntax. Whether this is accomplished by adding a flag, adding a v4t-ual specification, or some other mechanism is not of much concern to me. (Ideally it would be supported on decomp.me via objdiff, but that would likely need to be a separate discussion.)

AetiasHax commented 4 months ago

Just thinking out loud here, but I think it's a good idea to add support for passing flags to the instruction parser in general, whether the flag is "use UAL" or something else, like storing IT blocks for Thumb-2 support.

On the other hand, I think the reason DS decompilations use divided syntax is due to using the old MW toolchain. The MW compiler is required of course, but I don't see any reason not to use a modern assembler and linker. It might even be beneficial to edit the assembler/linker source code. So from that perspective, maybe the correct move is to use UAL syntax in all specifications?

WhenGryphonsFly commented 4 months ago

Just thinking out loud here, but I think it's a good idea to add support for passing flags to the instruction parser in general, whether the flag is "use UAL" or something else, like storing IT blocks for Thumb-2 support.

Upon thinking about it more, flags are probably the way to go - maybe a configuration file instead, but probably not a separate specification. Two other configuration options I can think of are the register naming convention (e.g., r9 vs v6 vs tr vs sb) and endianness (I used the same byte order as you when I made my YAML spec, but the GBA uses little endian instructions). I see register names are defined elsewhere, but endianness is baked into the spec right now and I wouldn't want to be the one maintaining four different YAML files for each syntax/endianness combination.

On the other hand, I think the reason DS decompilations use divided syntax is due to using the old MW toolchain.

That's my understanding as well, and that's also probably why GBA decompilations use unified syntax: the GCC toolchain emits unified by default. (Although GCC inline asm still uses divided by default, presumably for backwards compatibility.)

So from that perspective, maybe the correct move is to use UAL syntax in all specifications?

I mean, you already (accidentally?) do this for v6k, which predates v6t2 and thus UAL. I don't see a problem with UAL-by-default and a flag to switch to divided, but I'm the one asking for the change so I'm not exactly unbiased.

AetiasHax commented 4 months ago

Two other configuration options I can think of are the register naming convention (e.g., r9 vs v6 vs tr vs sb) and endianness.

R9 naming should be simple as that is outside the instruction parser, though the Rust enum variant would still be called Register::R9 if that's fine. Endianness is already supported via the Parser struct, which objdiff uses.

That's my understanding as well, and that's also probably why GBA decompilations use unified syntax: the GCC toolchain emits unified by default. (Although GCC inline asm still uses divided by default, presumably for backwards compatibility.)

Do you know if GBA decomps enable UAL in inline asm? If not, would it be bothersome if unarm only supports UAL? That would be my preferred course of action, just to keep the implementation simple.

WhenGryphonsFly commented 4 months ago

... though the Rust enum variant would still be called Register::R9 if that's fine.

Perfectly fine by me. As I alluded to, my use case would be decomp.me via objdiff, so it doesn't matter to me in the slightest what the enum calls it.

Endianness is already supported via the Parser struct, which objdiff uses.

Ah, didn't see it, sorry. Rust syntax only started making sense to me a couple of days ago, so I've done more black-box testing than closely reading the code. (Although why I didn't think to type "endian" into the search bar is beyond me.)

Do you know if GBA decomps enable UAL in inline asm?

Yes, it only defaults to divided syntax. You can use the directive .syntax unified at the top of any asm block (and indeed, pret GBA decomps define a macro asm_unified(x) that amounts to placing .syntax unified at the beginning of the block).

AetiasHax commented 4 months ago

Alright, so then my plan is to convert all specs to UAL syntax, then provide flags for naming R9. Thanks for your input!

WhenGryphonsFly commented 4 months ago

then provide flags for naming R9

To be clear, r9 is merely the worst case, having four distinct names. Each register has at least two names, so it may be wise to have flags refer to different naming conventions rather than different names. I am aware of at least three different naming conventions: one which uses numbers for everything except pc (r0-r14, pc), the one used by early Thumb manuals (r0-r12, sp, lr, pc), and the one used by GCC (r0-r9, sl, fp, ip, sp, lr, pc). I'm not actually aware of any convention that uses the a/v register naming scheme, but the names are considered valid for assembly.

These two webpages combined list every name I am aware of:

AetiasHax commented 4 months ago

I did some digging, and it seems that the register naming depends on the calling standard and platform. Here's a summary:

r9 = sb when using position-independent data (PID)
r9 = tr when using thread-local storage
r10 = sl when using explicit stack limits
r11 = fp when using frame pointers
r12 = ip when using interworking or long branches

To make it as customizable as possible, I'll add flags for these five register names and one for a/v. As for SP and LR, I couldn't find a source which calls them r13 and r14, so they should be called sp and lr by default in my opinion. But I can add a flag to change the name if it's needed.

Edit: r13 and r14 are allowed, but what I meant was that SP and LR has always served the same purpose as far as I know.

WhenGryphonsFly commented 4 months ago

I couldn't find a source which calls them r13 and r14

That "convention" came from capstone, which has an option to use all numbers (except for pc). I only used it because by default capstone used names for r9 onwards, and I decided I'd rather see r13 and r14 instead of seeing names that aren't relevant on the GBA[^1]. However, if you use flags for r9-r12, there probably isn't a need for this convention; capstone, in their infinite wisdom, only has[^2] two conventions: "all names" and "all numbers".

[^1]: AFAICT the only name that might be used on the GBA is fp for r11 (other than sp and lr, of course). Even then I can't confirm that because it appears that if you don't use r11 as fp, GCC just decides to never use r11.

[^2]: At least, at the time it was "all names" and "all numbers". They were between versions when I last used it, and looking at the source code it seems that they've changed "all names" to "some names". I'm not reinstalling it to confirm that, though, and it still seems like there's still only two options.

AetiasHax commented 4 months ago

Classic capstone moment. Well in that case, I'll be adding flags for r9-12 plus a/v register names. UAL implementation is done, but I'll lump in the flags in the same release, then close this issue once we have a way to set these flags in objdiff.

AetiasHax commented 3 months ago

Unified syntax support and register name options are now added in objdiff v2.0.0-beta1! You can configure it in the Arch Settings menu.

WhenGryphonsFly commented 3 months ago

Great! I look forward to using it soon! (Hopefully using it soon; distro hopping has not been going smoothly at all, as may be evident from the late reply.)

AetiasHax / unarm

Feature Request: Add support for unified syntax (UAL) in v4t (and possibly v5te) #3