NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
51.25k stars 5.84k forks source link

6502 disassembler inconsistent with LDA addr,X #201

Closed tautology0 closed 5 years ago

tautology0 commented 5 years ago

Describe the bug 6502 has an instruction: LDA addr,X and LDA addr,Y which loads the assembler with the contents of address (addr + X) or (addr + Y). When this is disassembled sometimes the results has a label, sometimes it doesn't.

An example is shown below, from the same binary:

            078f bd 00 0c        LDA        $0xc00,X
            0792 c9 20           CMP        #0x20
            0794 d0 f8           BNE        LAB_078e
            0796 86 02           STX        $PORTC
                             LAB_0798                                        XREF[1]:     077e(j)  
            0798 a2 00           LDX        #0x0
                             LAB_079a                                        XREF[1]:     07a3(j)  
            079a bd 00 0c        LDA        $DAT_0c00,X
            079d 20 ee ff        JSR        OSWRCH                                           void OSWRCH(byte character)

As can be seen, the entry at 0x78f uses 0xc00 whereas the entry at 0x79a uses the label $DAT_0c00. Looking at the Instruction Info dialogue, it appears that on the one with a label it decides that 0xc00 is an address, but doesn't on the one without a label.

In all cases with 6502, 0xbd means that the passed 16-bit parameter is an address.

To Reproduce Disassemble a binary with multiple LDA addr,X in it. Look for inconsistencies.

Expected behavior In all cases the mnemonic should have the parameter being an address and use the label.

Environment (please complete the following information):

mumbel commented 5 years ago

I have a theory that it might rely on knowing X. for example you have LDX 0x0 ; LDA $0x0c00, X so it can translate that to 0x0C00 + 0 -> 0x0C00, but it has a ref there so its DAT_0c00 if you had LDX 0x1 ; LDA $0x0c04, X it would disassemble into "LDA $0xc04,X=>DAT_0c05"

Can't tell by your example, but does your program conflict with this idea?

tautology0 commented 5 years ago

That would make sense, but there's also a third state, where is shows an offset:

            07a6 a0 00           LDY        #0x0
            07a8 e4 03           CPX        $PORTD
            07aa f0 02           BEQ        LAB_07ae
            07ac b0 0e           BCS        LAB_07bc
            07ae bd 00 0c        LDA        $0xc00,X=>DAT_0c02
            07b1 99 00 0c        STA        $0xc00,Y=>DAT_0c02

I can't quite work out where it's getting the values of 0x02 for X and Y, especially as the nearest LDY # is for LDY #0 (there are INXs and INYs below).

TBH I would rather it references the label for the address passed in the mnemonic for all cases. Normally an indexed LDX/LDY is used to move memory around (like above) or to set memory to a specific value. Trying to guess the memory location is just going to be confusing and wrong most of the time.

mumbel commented 5 years ago

There is a bug in 6502.slaspec looks like at least to answer the question of 3rd state

Looks like OP1 is broken for imm16,Y

OP1: $imm16,Y           is bbb=6 & Y; imm16             { tmp:2 = imm16 + zext(X); export *:1 tmp; }
OP1: $imm16,Y           is bbb=6 & Y; imm16             { tmp:2 = imm16 + zext(Y); export *:1 tmp; }
tautology0 commented 5 years ago

That would explain why this case that I found later made no sense to me:

            681b a0 42           LDY        #0x42
                             LAB_681d                                        XREF[1]:     6829(j)  
            681d b9 c3 65        LDA        $0x65c3,Y=>DAT_65c5
            6820 99 ff 03        STA        $0x3ff,Y=>DAT_0401
            6823 a9 00           LDA        #0x0
            6825 99 af 04        STA        $0x4af,Y=>BYTE_04b1
            6828 88              DEY

I still think that adding what Ghidra think is the value of X or Y shouldn't be show in the mnemonic. 6502 isn't like ARM, where the offset is an integer and is usually static, because 6502 is normally hand coded, the index instructions are often used as index so it would be much more useful to see the label each time (as can be seen by the DEY [DEcrease Y] instruction at the bottom there).

If Ghidra thinks it knows the value of X or Y, then that should be a comment, not part of the mnemonic. So above I'd expect to see:

            681b a0 42           LDY        #0x42
                             LAB_681d                                        XREF[1]:     6829(j)  
            681d b9 c3 65        LDA        $DAT_65c3,Y                      Initial Y = DAT_6605
            6820 99 ff 03        STA        $DAT_3ff,Y                       Initial Y = DAT_441        
            6825 99 af 04        STA        $DAT_4af,Y                       Initial Y = DAT_4f1
            6828 88              DEY

Also, in normal 6502 an address isn't preceded by a $ - it's just raw, e.g.:

            LDA 0x1234,Y

Often 6502 listings will use a $ (or an &) to signify hex, but this was simple because that's how the assembler required them.

tom-seddon commented 5 years ago

I've been trying to use Ghidra for 6502, and came to the same conclusion. This behaviour makes sense for modern CPUs, but not for the 6502.

I've cobbled together a new analysis plugin and 6502 description to at least fix the obvious stuff: https://github.com/tom-seddon/Ghidra6502 - so far, it's got an alternative 6502 description (including a variant for working with 65C02 code), and a new analyzer that hopefully fixes the indexed addressing modes.

I have a disassembly project I plan on chipping away at using this, so unless Ghidra turns out to end up wholly unsuitable for 6502 (for some reason I haven't discovered yet) I'll hopefully be fixing any serious issues with it over time.

--Tom

tom-seddon commented 5 years ago

From my perspective, the main problem here doesn't appear to be fixed, as a symbol is still created for the wrong address when using indexed addressing.

Repro steps for my test case can be found here: https://reverseengineering.stackexchange.com/questions/20810/can-i-stop-ghidra-from-creating-extended-references

Here's the result in Ghidra 9.0.1 - same as Ghidra 9.0.0, pretty much: (though the lack of spurious $s is not going unnoticed)

                             //
                             // RAM 
                             // fileOffset=0, length=11
                             // RAM: 0400-040a
                             //
            0400 a2 a3           LDX        #0xa3
            0402 a9 00           LDA        #0x0
                             LAB_0404                                        XREF[1]:     0408(j)  
            0404 9d 40 00        STA        0x40,X=>DAT_00e3                                 = ??
            0407 ca              DEX
            0408 10 fa           BPL        LAB_0404
            040a 60              RTS

The label DAT_00e3 is bogus. There should be a label created for address 0x40 - DAT_0040, I suppose? - meaning STA 0x40,X should get disassembled as something more like STA DAT_0040,X.

Thanks,

--Tom

tautology0 commented 5 years ago

I agree 100% - having the label at the value in the mnemonic not only matches real code better, but is actually easier to code.

At the very least, can we switch the behaviour through a setting (with the default being the value in the mnemonic)?

tautology0 commented 5 years ago

This is still buggy - can we have the bug reopened? Every indexed and indirect case is being processed wrongly. Here's some test code showing zero page indexed, indexed and indirect X address:

            0800 a2 f0           LDX        #0xf0
            0802 b5 80           LDA        0x80,X=>DAT_0170
            0804 bd 00 80        LDA        0x8000,X=>DAT_80f0
            0807 a1 80           LDA        (0x80,X)=>DAT_0170

The zero page LDA at 0802 results in a DAT_0170, which is wrong as the 0xb5 opcode wraps around if it goes out side zero page. The 0xA1 opcode uses (the contents of 0x80 + X); (the contents of 0x80 +X + 1); not the value of 0x80.

All this would make more sense if the was just set to the that pointed to by the mnemonic, in the above case a label at 0x80 and 0x8000.

emteere commented 5 years ago

Sounds more like the pcode is incorrect in the above cases. If possible, would like to keep the constant computation correct. Does the Decompiler show something reasonable?

What you show above is definitely incorrect, but should work if the pcode is correct. There were some changes that should make this better in 9.0.2, but may not be exactly what you want.

I have seen code that sets a value in X, and then indexes off the base. If X is not a constant, then you will get the base offset. Unless the reference is too small. You can change what is considered too small in the constant analyzer.

One possibility is to force an address instead of a constant to be exported from a subconstructor. Will play with that when checking the pcode.

A small amount of tuning could be done in a 6502 targeted analyzer.

emteere commented 5 years ago

The following should fix the Zero Page Indexed Addressing. This won't change the references, but should address the location correctly in the listing and decompiler.

################################################################
REL: reloc      is rel  [ reloc = inst_next + rel; ] { export *:2 reloc; } 

# Immediate
OP1: "#"imm8    is bbb=2; imm8          { tmp:1 = imm8; export tmp; }
# Zero Page
OP1: imm8       is bbb=1; imm8          { export *:1 imm8; }
# Zero Page Indexed X
OP1: imm8,X     is bbb=5 & X; imm8      { tmp:2 = zext(imm8 + X); export *:1 tmp; }
# Absolute
OP1: imm16      is bbb=3; imm16         { export *:1 imm16; }
# Absolute Indexed X
OP1: imm16,X    is bbb=7 & X; imm16     { tmp:2 = imm16 + zext(X); export *:1 tmp; }
# Absolute Indexed Y
OP1: imm16,Y    is bbb=6 & Y; imm16     { tmp:2 = imm16 + zext(Y); export *:1 tmp; }
# Indirect X
OP1: (imm8,X)   is bbb=0 & X; imm8      { addr:2 = zext(imm8 + X); tmp:2 = *:2 addr; export *:1 tmp; }
# Indirect Y
OP1: (imm8),Y   is bbb=4 & Y; imm8      { addr:2 = imm8; tmp:2 = *:2 addr; tmp = tmp + zext(Y); export *:1 tmp; }

# Immediate
OP2: "#"imm8    is bbb=0; imm8          { tmp:1 = imm8; export tmp; }
# Zero Page
OP2: imm8       is bbb=1; imm8          { export *:1 imm8; }
OP2: A          is bbb=2 & A            { export A; }
# Absolute
OP2: imm16      is bbb=3; imm16         { export *:1 imm16; }
# Zero Page Indexed X
OP2: imm8,X     is bbb=5 & X; imm8      { tmp:2 = zext(imm8 + X); export *:1 tmp; }
# Absolute Indexed X
OP2: imm16,X    is bbb=7 & X; imm16     { tmp:2 = imm16 + zext(X); export *:1 tmp; }

OP2ST: OP2      is OP2                  { export OP2; }
OP2ST: imm8,Y   is bbb=5 & Y; imm8      { tmp:2 = zext(imm8 + Y); export *:1 tmp; }

OP2LD: OP2      is OP2                  { export OP2; }
OP2LD: imm8,Y   is bbb=5 & Y; imm8      { tmp:2 = zext(imm8 + Y); export *:1 tmp; }
OP2LD: imm16,Y  is bbb=7 & Y; imm16     { tmp:2 = imm16 + zext(Y); export *:1 tmp; }

ADDR8:  imm8    is imm8     { export *:1 imm8; }
ADDR16: imm16   is imm16    { export *:1 imm16; }
ADDRI:  imm16   is imm16    { tmp:2 = imm16; export *:2 tmp; }
tautology0 commented 5 years ago

Can we stop closing this? The only solutions provided so far have not fixed the root issue: that is Ghidra tries to be too clever and results in a wrong disassembly in 99% of cases. This is making Ghidra unusable for 6502.

In the case of indexed and indirect addressing we don't want the value of X or Y added to the base, all we want is for Ghidra to mimic the behaviour of other disassembler since the 1980s and including IDA: to show the base as a label and that's it.

E.g. we don't want:

 6820 99 ff 03        STA        0x3ff,Y=>DAT_0401

We want:

 6820 99 ff 03        STA        DAT_3ff,Y   ; =>DAT_0401

Having the label in the right place is what makes it possible to work effectively.

As an example of how wrong it can be, a common pattern is (using a decrement to save an instruction):

         LDX #0x50
loop: LDA source,X
         STA dest,X
         DEX
         BNE loop

In Ghidra this would end up something like:

         LDX #0x50
loop: LDA 0xc00,X=>DAT_C50
         STA 0x900,X =>DAT_950
         DEX
         BNE loop

With the only label being created at the end of the data block instead of the start of the data block, which is where you would logically require a label.

I don't know the language used to define the mnemonics, but to me it should read:

# Absolute Indexed X
OP1: imm16,X    is bbb=7 & X; imm16     { tmp:2 = imm16; export *:1 tmp; }
# Absolute Indexed Y
OP1: imm16,Y    is bbb=6 & Y; imm16     { tmp:2 = imm16; export *:1 tmp; }
ryanmkurtz commented 5 years ago

A fix was put in for the upcoming Ghidra 9.0.3.

tautology0 commented 5 years ago

Ah, okay, I couldn't see that in the source on github.

ryanmkurtz commented 5 years ago

Sorry for the confusion. It has been pushed to our internal repo so it will be in our next release, but the sync to GitHub hasn't happened yet. We aim to synchronization daily, but are still working out the kinks. You will see it soon.

tautology0 commented 5 years ago

Still getting bad disassemblies, this is an example from something I'm working on:

            16cb a2 08           LDX        #0x8
                             copyLoop                                        XREF[1]:     16d4(j)  
            16cd bd 37 0b        LDA        0xb37,X=>DefaultObjects[7]                       = null
            16d0 9d 2f 0b        STA        0xb2f,X=>DAT_0b37
            16d3 ca              DEX
            16d4 d0 f7           BNE        copyLoop

Because Ghidra is trying to "guess" the value of X it ends up at the top of the range (0b37) instead of the bottom of the range where it would be useful (0b2f). In this case I would expect the label for the STA to be placed at DAT_0b2f instead of DAT_0b37.

Similarly for the LDA, I'd expect the label to be placed at 0xb37 (i.e. the start of the array) not at the end of it.

It does get it right on the decompiled view though:

        do {
          (&CurrentRoom)[(ushort)bVar1] = (&DAT_0b37)[(ushort)bVar1];
          bVar1 = bVar1 - 1;
        } while (bVar1 != 0);

(Where CurrentRoom = 0xb2f)

petebeck1 commented 5 months ago

This behaviour is still there in the latest public release 11.0.3.

           343f a2 0c           LDX        #0xc
            3441 a0 1b           LDY        #0x1b
                             initNameLookup                                  XREF[1]:     3453(j)  
            3443 8a              TXA
            3444 99 ee 06        STA        0x6ee,Y=>DAT_0709
            3447 18              CLC
            3448 69 14           ADC        #0x14
            344a aa              TAX
            344b a9 07           LDA        #0x7
            344d 99 ef 06        STA        0x6ef,Y=>DAT_070a

Is it a regression, or is this expected?

emteere commented 3 months ago

This is currently expected behavior, but not the behavior you might expect. The constant analysis is a semi-symbolic flow analysis which follows all paths. Because the path has a known initial value of 0x1b for Y, it produces a reference to the constant computed location. Were Y to be unknown, it would produce a reference to the base value 0x6ef.

The decompiler is doing a full function data flow analysis and can see what values are variable within a looping block. The constant analysis is more like emulation that can hold some values as not fully constant if they are unknown.

I agree that it might be better in some cases to have the reference created to the top of the loop. There are some benefits as you get a reference to the bottom of the array. In the example above yours, the two arrays seem to be overlapped. The location 0xb37 is never read because X is never 0 even though it is seen as the start of the array because of the LDA. The location 0xb37 is actually written by the STA because the first write occurs there.

There would need to be a more full function data flow analysis done to know that a register is variable within an area of code that is executed more than once, such as a loop. This could be done as first step to help control the laying down of references.

tautology0 commented 1 month ago

It may be expected behaviour, but it's totally undesirable and produces a next-to-useless disassembly. Can we at least have a (default enabled) option to tell Ghidra to not try and be clever for indexed mnemonics?

I pretty much always want the label of LDA 0xc00,X to be set to 0xc00, no matter what X is.