UsernameFodder / pmdsky-debug

Debug info for reverse engineering PMD: Explorers of Sky
GNU General Public License v3.0
38 stars 20 forks source link

Add instruction immediates to header files #258

Closed Frostbyte0x70 closed 3 months ago

Frostbyte0x70 commented 3 months ago

literals.yml lists values that are embedded into instructions as literals. However, those values are not documented anywhere in the header files. I think it could be helpful to have those explicitly declared with their own type in the headers. I named the type sh_imm, short for "shifted immediate", which I thought was too long. Suggestions for alternatives are welcome though.

I also added the DEBUG_TEXT_SPEED literal and renamed TEXT_SPEED to REGULAR_TEXT_SPEED. Sorry, I saw that only one of the two was documented and couldn't resist.

This PR is created on top of #257, so it can't be merged yet. I'll update it once that one is merged.

TODO: Sort the newly added symbols in the header files. Calculating their exact position is annoying since I have to compare the definitions in literals.yml with the ones in .yml, so I'm letting the automated check do it for me.

Frostbyte0x70 commented 3 months ago

Alright, sync-check doesn't like that symbols in the .h file are not defined in the corresponding .yml file. I guess we'll have to figure out how to deal with that. Maybe just get rid of literals.yml entirely and include these values as regular symbols? The new type can be used to know they are immediate values, so placing them on a separate file might not be necessary anymore.

Frostbyte0x70 commented 3 months ago

I'm not a fan of this change. The reason I cordoned all the immediate values to their own separately symbol table file is so that they wouldn't appear in the generated symbol tables, since they aren't "real"/valid symbols that any code analysis tool (Ghidra, No$ debugger) knows how to deal with properly.

Semantically, it doesn't make sense to label these as a "C type" at all. I don't think any compiler would see these declarations and generate valid code if you tried to reference the actual data they link to. So that would have negative implications for c-of-time (and any other project that relies on the C headers being valid), for example.

Why do you consider the symbols to be invalid? They are embedded into an instruction, but they are still valid 2-byte data (or, well, 1.5-byte data, I guess). Does anything really break just for defining these?

I can't think of a way to achieve this kind of labeling that isn't fragile. The representation of any given immediate value is instruction-dependent, isn't it? An immediate for a mov instruction might be encoded in a different way than that for an add or a ldr instruction. Does your code just assume they're all movs?

I don't think so. All instructions that have an immediate as their value use the last 3 nibbles to encode it. That includes mov, add, sub, tst, etc. Instructions that use immediates for addressing (such as ldr rX, [rY, offset]) don't use that format, but the immediate there does not carry any meaningful data.

For your changes, is it not possible to just assume every symbol defined in literals.yml is a shifted immediate type, and use that to implicitly interpret those addresses as the sh_imm type without needing to bake that into the C headers?

That's the other possibility, yes. I thought about adding the info to the C headers in case it was helpful, but if it's going to cause more harm than good, I can just implement this at the pmdsky-debug-py level. I wouldn't mind.

UsernameFodder commented 3 months ago

I can't think of a way to achieve this kind of labeling that isn't fragile. The representation of any given immediate value is instruction-dependent, isn't it? An immediate for a mov instruction might be encoded in a different way than that for an add or a ldr instruction. Does your code just assume they're all movs?

I don't think so. All instructions that have an immediate as their value use the last 3 nibbles to encode it. That includes mov, add, sub, tst, etc. Instructions that use immediates for addressing (such as ldr rX, [rY, offset]) don't use that format, but the immediate there does not carry any meaningful data.

Hmm, I guess this is more uniform than I thought. The manual classifies these all as "data processing instructions" that use the same encoding scheme for all fields, not just the immediate value.

Why do you consider the symbols to be invalid? They are embedded into an instruction, but they are still valid 2-byte data (or, well, 1.5-byte data, I guess). Does anything really break just for defining these?

Well, 1.5 bytes is very different from 2 bytes. The latter can be mapped to a C short, whereas the former needs to be encoded more specially.

I think I would be okay with this if we defined the type to more precisely map to the actual machine code bytes. And we should declare all instructions as const. Something like this:

// A raw ARMv5 data-processing instruction, such as MOV, ADD, AND, CMP, etc.
// See the ARMv5 Architecture Reference Manual, Section A3.4.1
// https://developer.arm.com/documentation/ddi0100/latest/
struct data_processing_instruction {
    // second source operand, either a shifted immediate value or a register, see Section A5.1
    uint32_t shifter_operand : 12;
    uint32_t rd : 4; // destination register
    uint32_t rn : 4; // first source operand register
    uint32_t s : 1; // status flag, set if the instruction updates the status registers
    uint32_t opcode : 4; // see Section A3.4, Table A3-2
    uint32_t i : 1; // immediate flag, set if shifter_operand represents an immediate
    uint32_t _zero : 2; // always 0
    uint32_t cond : 4; // condition code, see Section A3.2
};
ASSERT_SIZE(struct data_processing_instruction, 4);

extern const struct data_processing_instruction JUICE_BAR_NECTAR_IQ_GAIN;
...

We should also go over all the addresses defined in literals.yml and make sure they correspond to the offset of the start of the instruction, and not one of the bytes halfway through. Though, given that all the addresses are 4-byte aligned, I suspect this is already the case? Probably because shifter_operand happens to come first. We should also make sure the lengths are specified as 4 bytes, since we'll be typing them with a 4-byte struct. I see that JUICE_BAR_NECTAR_IQ_GAIN has a length of 1, so we should update that.

Frostbyte0x70 commented 3 months ago

I think I would be okay with this if we defined the type to more precisely map to the actual machine code bytes. That could work. Although the symbol wouldn't be 100% accurate anymore. For example, JUICE_BAR_NECTAR_IQ_GAIN no longer refers to the immediate value on that address, but rather to the whole instruction. Does it make sense to document that as a symbol?

And we should declare all instructions as const. Well, technically, they can be modified while in RAM. No one is stopping you from doing that.

We should also go over all the addresses defined in literals.yml and make sure they correspond to the offset of the start of the instruction, and not one of the bytes halfway through. Though, given that all the addresses are 4-byte aligned, I suspect this is already the case? Probably because shifter_operand happens to come first. Yeah, the operand comes first, so I don't think any changes are needed here.

I see that JUICE_BAR_NECTAR_IQ_GAIN has a length of 1, so we should update that. Huh, you're right. It's the only literal that has a lenght field. That should be changed.

UsernameFodder commented 3 months ago

Although the symbol wouldn't be 100% accurate anymore. For example, JUICE_BAR_NECTAR_IQ_GAIN no longer refers to the immediate value on that address, but rather to the whole instruction. Does it make sense to document that as a symbol?

I think this is fine. It's already conveyed that these are literals anyway, not normal symbols. And since the value can involve a fractional number of bytes, trying to convey the exact location of the immediate value isn't possible in all cases anyway. The type will be documented in the C headers.

Well, technically, they can be modified while in RAM. No one is stopping you from doing that.

Well, by this logic, C shouldn't have a const keyword. Anything can be modified in RAM, but that doesn't mean they are, and doesn't mean a hacker should modify them in most cases. Labeling something as const doesn't actually do anything at the machine code level, it's purely a construct for the compiler to enforce restrictions on what C code is allowable. In these C headers, I think it's mainly useful as documentation. These instruction offsets aren't "variables" in the usual sense.

Frostbyte0x70 commented 3 months ago

Alright, I'm okay with this implementation then. I'll change it when I have some time.

Frostbyte0x70 commented 3 months ago

Alright, the new implementation is ready. The synchronization script needs to be updated to account for the fact that symbols might be defined in literals.yml reather than the yml file corresponding to the current binary. The sort order check might need to be changed too.

Frostbyte0x70 commented 3 months ago

It seems like sync-check follows strictly alphabetical order rather than numeric order for overlays. That should probably be changed.

UsernameFodder commented 3 months ago

It seems like sync-check follows strictly alphabetical order rather than numeric order for overlays. That should probably be changed.

It's mainly because the blocks are named, e.g., overlay0 rather than overlay00. I've been worried about changing it and possibly breaking things that rely on the old names, so I haven't gotten around to it. I'm not sure it's worth changing it just for the sake of the literals table anyway.

Frostbyte0x70 commented 3 months ago

Isn't it possible to just change the sorting algorithm without renaming them?

UsernameFodder commented 3 months ago

Isn't it possible to just change the sorting algorithm without renaming them?

No. The sort order comes from the YAML, and the YAML sorting needs to work for arbitrary strings.

Frostbyte0x70 commented 3 months ago

Isn't it possible to just change the sorting algorithm without renaming them?

No. The sort order comes from the YAML, and the YAML sorting needs to work for arbitrary strings.

I see. Oh well, I'll just reorder it then. Not like reordering them again would be an issue should you ever change the naming convention.