Closed tautology0 closed 5 years ago
I have a theory that it might rely on knowing X.
for example you have
LDX 0x0 ; LDA $0x0c00, X
so it can translate that to 0x0C00 + 0 -> 0x0C00, but it has a ref there so its DAT_0c00
if you had
LDX 0x1 ; LDA $0x0c04, X
it would disassemble into "LDA $0xc04,X=>DAT_0c05"
Can't tell by your example, but does your program conflict with this idea?
That would make sense, but there's also a third state, where is shows an offset:
07a6 a0 00 LDY #0x0
07a8 e4 03 CPX $PORTD
07aa f0 02 BEQ LAB_07ae
07ac b0 0e BCS LAB_07bc
07ae bd 00 0c LDA $0xc00,X=>DAT_0c02
07b1 99 00 0c STA $0xc00,Y=>DAT_0c02
I can't quite work out where it's getting the values of 0x02 for X and Y, especially as the nearest LDY # is for LDY #0 (there are INXs and INYs below).
TBH I would rather it references the label for the address passed in the mnemonic for all cases. Normally an indexed LDX/LDY is used to move memory around (like above) or to set memory to a specific value. Trying to guess the memory location is just going to be confusing and wrong most of the time.
There is a bug in 6502.slaspec looks like at least to answer the question of 3rd state
Looks like OP1 is broken for imm16,Y
OP1: $imm16,Y is bbb=6 & Y; imm16 { tmp:2 = imm16 + zext(X); export *:1 tmp; }
OP1: $imm16,Y is bbb=6 & Y; imm16 { tmp:2 = imm16 + zext(Y); export *:1 tmp; }
That would explain why this case that I found later made no sense to me:
681b a0 42 LDY #0x42
LAB_681d XREF[1]: 6829(j)
681d b9 c3 65 LDA $0x65c3,Y=>DAT_65c5
6820 99 ff 03 STA $0x3ff,Y=>DAT_0401
6823 a9 00 LDA #0x0
6825 99 af 04 STA $0x4af,Y=>BYTE_04b1
6828 88 DEY
I still think that adding what Ghidra think is the value of X or Y shouldn't be show in the mnemonic. 6502 isn't like ARM, where the offset is an integer and is usually static, because 6502 is normally hand coded, the index instructions are often used as index so it would be much more useful to see the label each time (as can be seen by the DEY [DEcrease Y] instruction at the bottom there).
If Ghidra thinks it knows the value of X or Y, then that should be a comment, not part of the mnemonic. So above I'd expect to see:
681b a0 42 LDY #0x42
LAB_681d XREF[1]: 6829(j)
681d b9 c3 65 LDA $DAT_65c3,Y Initial Y = DAT_6605
6820 99 ff 03 STA $DAT_3ff,Y Initial Y = DAT_441
6825 99 af 04 STA $DAT_4af,Y Initial Y = DAT_4f1
6828 88 DEY
Also, in normal 6502 an address isn't preceded by a $ - it's just raw, e.g.:
LDA 0x1234,Y
Often 6502 listings will use a $ (or an &) to signify hex, but this was simple because that's how the assembler required them.
I've been trying to use Ghidra for 6502, and came to the same conclusion. This behaviour makes sense for modern CPUs, but not for the 6502.
I've cobbled together a new analysis plugin and 6502 description to at least fix the obvious stuff: https://github.com/tom-seddon/Ghidra6502 - so far, it's got an alternative 6502 description (including a variant for working with 65C02 code), and a new analyzer that hopefully fixes the indexed addressing modes.
I have a disassembly project I plan on chipping away at using this, so unless Ghidra turns out to end up wholly unsuitable for 6502 (for some reason I haven't discovered yet) I'll hopefully be fixing any serious issues with it over time.
--Tom
From my perspective, the main problem here doesn't appear to be fixed, as a symbol is still created for the wrong address when using indexed addressing.
Repro steps for my test case can be found here: https://reverseengineering.stackexchange.com/questions/20810/can-i-stop-ghidra-from-creating-extended-references
Here's the result in Ghidra 9.0.1 - same as Ghidra 9.0.0, pretty much: (though the lack of spurious $
s is not going unnoticed)
//
// RAM
// fileOffset=0, length=11
// RAM: 0400-040a
//
0400 a2 a3 LDX #0xa3
0402 a9 00 LDA #0x0
LAB_0404 XREF[1]: 0408(j)
0404 9d 40 00 STA 0x40,X=>DAT_00e3 = ??
0407 ca DEX
0408 10 fa BPL LAB_0404
040a 60 RTS
The label DAT_00e3
is bogus. There should be a label created for address 0x40
- DAT_0040
, I suppose? - meaning STA 0x40,X
should get disassembled as something more like STA DAT_0040,X
.
Thanks,
--Tom
I agree 100% - having the label at the value in the mnemonic not only matches real code better, but is actually easier to code.
At the very least, can we switch the behaviour through a setting (with the default being the value in the mnemonic)?
This is still buggy - can we have the bug reopened? Every indexed and indirect case is being processed wrongly. Here's some test code showing zero page indexed, indexed and indirect X address:
0800 a2 f0 LDX #0xf0
0802 b5 80 LDA 0x80,X=>DAT_0170
0804 bd 00 80 LDA 0x8000,X=>DAT_80f0
0807 a1 80 LDA (0x80,X)=>DAT_0170
The zero page LDA at 0802 results in a DAT_0170, which is wrong as the 0xb5 opcode wraps around if it goes out side zero page. The 0xA1 opcode uses (the contents of 0x80 + X); (the contents of 0x80 +X + 1); not the value of 0x80.
All this would make more sense if the was just set to the that pointed to by the mnemonic, in the above case a label at 0x80 and 0x8000.
Sounds more like the pcode is incorrect in the above cases. If possible, would like to keep the constant computation correct. Does the Decompiler show something reasonable?
What you show above is definitely incorrect, but should work if the pcode is correct. There were some changes that should make this better in 9.0.2, but may not be exactly what you want.
I have seen code that sets a value in X, and then indexes off the base. If X is not a constant, then you will get the base offset. Unless the reference is too small. You can change what is considered too small in the constant analyzer.
One possibility is to force an address instead of a constant to be exported from a subconstructor. Will play with that when checking the pcode.
A small amount of tuning could be done in a 6502 targeted analyzer.
The following should fix the Zero Page Indexed Addressing. This won't change the references, but should address the location correctly in the listing and decompiler.
################################################################
REL: reloc is rel [ reloc = inst_next + rel; ] { export *:2 reloc; }
# Immediate
OP1: "#"imm8 is bbb=2; imm8 { tmp:1 = imm8; export tmp; }
# Zero Page
OP1: imm8 is bbb=1; imm8 { export *:1 imm8; }
# Zero Page Indexed X
OP1: imm8,X is bbb=5 & X; imm8 { tmp:2 = zext(imm8 + X); export *:1 tmp; }
# Absolute
OP1: imm16 is bbb=3; imm16 { export *:1 imm16; }
# Absolute Indexed X
OP1: imm16,X is bbb=7 & X; imm16 { tmp:2 = imm16 + zext(X); export *:1 tmp; }
# Absolute Indexed Y
OP1: imm16,Y is bbb=6 & Y; imm16 { tmp:2 = imm16 + zext(Y); export *:1 tmp; }
# Indirect X
OP1: (imm8,X) is bbb=0 & X; imm8 { addr:2 = zext(imm8 + X); tmp:2 = *:2 addr; export *:1 tmp; }
# Indirect Y
OP1: (imm8),Y is bbb=4 & Y; imm8 { addr:2 = imm8; tmp:2 = *:2 addr; tmp = tmp + zext(Y); export *:1 tmp; }
# Immediate
OP2: "#"imm8 is bbb=0; imm8 { tmp:1 = imm8; export tmp; }
# Zero Page
OP2: imm8 is bbb=1; imm8 { export *:1 imm8; }
OP2: A is bbb=2 & A { export A; }
# Absolute
OP2: imm16 is bbb=3; imm16 { export *:1 imm16; }
# Zero Page Indexed X
OP2: imm8,X is bbb=5 & X; imm8 { tmp:2 = zext(imm8 + X); export *:1 tmp; }
# Absolute Indexed X
OP2: imm16,X is bbb=7 & X; imm16 { tmp:2 = imm16 + zext(X); export *:1 tmp; }
OP2ST: OP2 is OP2 { export OP2; }
OP2ST: imm8,Y is bbb=5 & Y; imm8 { tmp:2 = zext(imm8 + Y); export *:1 tmp; }
OP2LD: OP2 is OP2 { export OP2; }
OP2LD: imm8,Y is bbb=5 & Y; imm8 { tmp:2 = zext(imm8 + Y); export *:1 tmp; }
OP2LD: imm16,Y is bbb=7 & Y; imm16 { tmp:2 = imm16 + zext(Y); export *:1 tmp; }
ADDR8: imm8 is imm8 { export *:1 imm8; }
ADDR16: imm16 is imm16 { export *:1 imm16; }
ADDRI: imm16 is imm16 { tmp:2 = imm16; export *:2 tmp; }
Can we stop closing this? The only solutions provided so far have not fixed the root issue: that is Ghidra tries to be too clever and results in a wrong disassembly in 99% of cases. This is making Ghidra unusable for 6502.
In the case of indexed and indirect addressing we don't want the value of X or Y added to the base, all we want is for Ghidra to mimic the behaviour of other disassembler since the 1980s and including IDA: to show the base as a label and that's it.
E.g. we don't want:
6820 99 ff 03 STA 0x3ff,Y=>DAT_0401
We want:
6820 99 ff 03 STA DAT_3ff,Y ; =>DAT_0401
Having the label in the right place is what makes it possible to work effectively.
As an example of how wrong it can be, a common pattern is (using a decrement to save an instruction):
LDX #0x50
loop: LDA source,X
STA dest,X
DEX
BNE loop
In Ghidra this would end up something like:
LDX #0x50
loop: LDA 0xc00,X=>DAT_C50
STA 0x900,X =>DAT_950
DEX
BNE loop
With the only label being created at the end of the data block instead of the start of the data block, which is where you would logically require a label.
I don't know the language used to define the mnemonics, but to me it should read:
# Absolute Indexed X
OP1: imm16,X is bbb=7 & X; imm16 { tmp:2 = imm16; export *:1 tmp; }
# Absolute Indexed Y
OP1: imm16,Y is bbb=6 & Y; imm16 { tmp:2 = imm16; export *:1 tmp; }
A fix was put in for the upcoming Ghidra 9.0.3.
Ah, okay, I couldn't see that in the source on github.
Sorry for the confusion. It has been pushed to our internal repo so it will be in our next release, but the sync to GitHub hasn't happened yet. We aim to synchronization daily, but are still working out the kinks. You will see it soon.
Still getting bad disassemblies, this is an example from something I'm working on:
16cb a2 08 LDX #0x8
copyLoop XREF[1]: 16d4(j)
16cd bd 37 0b LDA 0xb37,X=>DefaultObjects[7] = null
16d0 9d 2f 0b STA 0xb2f,X=>DAT_0b37
16d3 ca DEX
16d4 d0 f7 BNE copyLoop
Because Ghidra is trying to "guess" the value of X it ends up at the top of the range (0b37) instead of the bottom of the range where it would be useful (0b2f). In this case I would expect the label for the STA to be placed at DAT_0b2f instead of DAT_0b37.
Similarly for the LDA, I'd expect the label to be placed at 0xb37 (i.e. the start of the array) not at the end of it.
It does get it right on the decompiled view though:
do {
(&CurrentRoom)[(ushort)bVar1] = (&DAT_0b37)[(ushort)bVar1];
bVar1 = bVar1 - 1;
} while (bVar1 != 0);
(Where CurrentRoom = 0xb2f)
This behaviour is still there in the latest public release 11.0.3.
343f a2 0c LDX #0xc
3441 a0 1b LDY #0x1b
initNameLookup XREF[1]: 3453(j)
3443 8a TXA
3444 99 ee 06 STA 0x6ee,Y=>DAT_0709
3447 18 CLC
3448 69 14 ADC #0x14
344a aa TAX
344b a9 07 LDA #0x7
344d 99 ef 06 STA 0x6ef,Y=>DAT_070a
Is it a regression, or is this expected?
This is currently expected behavior, but not the behavior you might expect. The constant analysis is a semi-symbolic flow analysis which follows all paths. Because the path has a known initial value of 0x1b for Y, it produces a reference to the constant computed location. Were Y to be unknown, it would produce a reference to the base value 0x6ef.
The decompiler is doing a full function data flow analysis and can see what values are variable within a looping block. The constant analysis is more like emulation that can hold some values as not fully constant if they are unknown.
I agree that it might be better in some cases to have the reference created to the top of the loop. There are some benefits as you get a reference to the bottom of the array. In the example above yours, the two arrays seem to be overlapped. The location 0xb37 is never read because X is never 0 even though it is seen as the start of the array because of the LDA. The location 0xb37 is actually written by the STA because the first write occurs there.
There would need to be a more full function data flow analysis done to know that a register is variable within an area of code that is executed more than once, such as a loop. This could be done as first step to help control the laying down of references.
It may be expected behaviour, but it's totally undesirable and produces a next-to-useless disassembly. Can we at least have a (default enabled) option to tell Ghidra to not try and be clever for indexed mnemonics?
I pretty much always want the label of LDA 0xc00,X to be set to 0xc00, no matter what X is.
Describe the bug 6502 has an instruction: LDA addr,X and LDA addr,Y which loads the assembler with the contents of address (addr + X) or (addr + Y). When this is disassembled sometimes the results has a label, sometimes it doesn't.
An example is shown below, from the same binary:
As can be seen, the entry at 0x78f uses 0xc00 whereas the entry at 0x79a uses the label $DAT_0c00. Looking at the Instruction Info dialogue, it appears that on the one with a label it decides that 0xc00 is an address, but doesn't on the one without a label.
In all cases with 6502, 0xbd means that the passed 16-bit parameter is an address.
To Reproduce Disassemble a binary with multiple LDA addr,X in it. Look for inconsistencies.
Expected behavior In all cases the mnemonic should have the parameter being an address and use the label.
Environment (please complete the following information):