Arakula / dasmfw

The DisASseMbler FrameWork
GNU General Public License v2.0
13 stars 4 forks source link

Absolute / zeropage addressing not working as expected #8

Open phillipeaton opened 2 years ago

phillipeaton commented 2 years ago

Following on from the forceaddr off issue, I had a problem with as65 not assembling and that's why I was attempting workarounds with forceaddr:

image

A bit of research shows that the 6809-style >z009d isn't used for 6502 assembly. Here are some assembler manual links that describe how they deal with absolute and zeropage addressing.

This in the best description: ACME Assembler

This is also good: KickAssembler

And this one: DASM (PDF page 61/62, manual page 52/53)

And one more that shows it in action: image

Generally it seems, there are two ways of managing absolute/zero page addressing:

  1. The old way where the assembler decides by the number of characters in the number, $nn = zero page, $nnnn = absolute.
  2. The modern way where you append a .z or .a to the lda.

It seems to me that all the assemblers I looked at recognize 1. and newer assemblers recognize 1. & 2. There appear to be other forcing parameters also that some assemblers recognize, but I'm not sure they're really necessary, Kick Assembler has specifically deprecated them all, apart from .z/.a.

The as65 assembler specifically complains about the jsr >z009d, probably because jsr is always uses a 16 bit address. I get four jsr errors in my listing, but there are many other instructions with this address mode e.g. asl >m0000 that do not throw an error. It would appear '>' is valid for addresses, but, from what I can tell, it's for manipulating the address data at assemble time, not selecting address mode. I still have a lot of non-matching code between the original binary and the dissassembled/reassembled binary, so I will look into this a bit more and report back.

phillipeaton commented 2 years ago

I still have a lot of non-matching code between the original binary and the dissassembled/reassembled binary, so I will look into this a bit more and report back.

I looked through about 25% of my code and there appears to be three occurrences that create differeneces between the original binary and the dissassembled/reassembled binary (with as65):

  1. Missing bytes due to absolute addressing reassembled with zero page addressing.
  2. Jumps with targets at different locations due to the missing bytes caused by 1.
  3. Strings with differences because I'm converting the whole disassembled source code to lower case to make the register names lower case.

I suspect that the missing bytes due to 1. are likely to be data items anyway, so I will add data statements to those and see how I get on.

I spent a couple of hours trying out a handful of different assembler to assemble the dasmfw output and it appears so far none of them are as compatibile as as65, perhaps because it's quite old, generally they use .word, .const and simlar to handle FCB, EQUs etc.

None of the assemblers are consistent with each other, so your idea of tailoring output to specific assembers is likely to be necessary.

Now I will push forward with as65 and attempt to get my dasmfw output source code to assemble and the look at other functionaly. I suspect I may then get other problems with incompatibilities, but let's see.

I came across another disassembler/assembler pair, namely BeebAsm/BeebDis (Pages 1 & 8) (although they are written by different people). BeebDis seems to have a similar philosophy to dasmfw....

BeebDis takes the philosophy that disassembly is a partly interactive process where, and that you will 
need to run it several times against a piece of code modifying the parameters each time as you discover 
the various areas of the code you are processing.

As such, BeebDis relies on the creation of a disassembly control file, which configures how disassembly
proceeds. This control file will contain directives that define the various areas of the disassembly
process.

...and includes some options for formatting output for an arbitrary assembler...

DefineByte <string>
How to define byte storage in the output code, defaults to ‘equb’ for 6502 and ‘fcb’ for 6809.

DefineWord <string>
How to define word storage (16 bits) in the output code, defaults to ‘equw’ for 6502 and ‘fdb’ for 6809.

I'm not sure how mature/compatible this BeebDis/BeebAsm combo is, but it's probably worth a look. It didn't seem to have the same depth of functionalality as dasmfw, but I only had a quick look.

I'll let you know how I get on with progress with as65.

phillipeaton commented 2 years ago
  1. Missing bytes due to absolute addressing reassembled with zero page addressing.
  2. Jumps with targets at different locations due to the missing bytes caused by 1.
  3. Strings with differences because I'm converting the whole disassembled source code to lower case to make the register names lower case.

I suspect that the missing bytes due to 1. are likely to be data items anyway, so I will add data statements to those and see how I get on.

I suspected right. 26 data lines later and I have 8k disassembled/reassembled identically. All I need to do is search/replace A, X and Y registers to lower case before assembling; I need an awk script :-)

Next step is to try some more dasmfw formatting commands.

Arakula commented 2 years ago

What would be better in your opinion:

Convenience vs. flexibility vs. necessity ... if 100% of all use cases require the same setting for both, a separate option would be counterproductive ...

phillipeaton commented 2 years ago

My gut feeling says one option only. Given the register name is included in the mnemonic name e.g. 3rd character of lda, referring to other registers using a different case on the same line would seem inconsistent.

I don't think I've seen that inconsistency in any of the docs for assemblers I've looked at recently, except in the DASM manual, though I don't believe it's a requirement to enter code that way.

The manual states case-insensitive for several features, although not regarding register names specifically:

image

Arakula commented 2 years ago

I've uploaded a bigger update plus executable in https://github.com/Arakula/dasmfw/releases/tag/v0.26 that ...

phillipeaton commented 2 years ago

Thanks for the code update! I can confirm that my raw dasmfw v0.26 disassembly now reassembles successfully with as65, without any manual changes to the diassembled source code. I'm using 26 dasmfw data instructions to avoid any zero-page issues and option upmnemo off.

I reviewed your "TODO" header text. From what I have ascertained recently, "<" and ">" are often used to assemble the top or bottom byte of an assembler variable, for an 8-bit CPU, I guess that could be quite useful.

Regarding using $00 or $0000 to decide on absolute or zeropage, I don't think that's a bad solution to the problem if that's all that the assembler will accomodate, even if it's just a default method (though I appreciate it might be difficult to implement). However, I would question how robust a disassembly solution can ever be for absolute or zeropage addressing. I don't know why the programmer would want to force an absolute address, maybe for a special cycle-timing need, but otherwise I'm thinking you'd only ever do it sparingly. Unfortunately that will fool the disassembler often and you can never do anything about having the issue in data blocks. All you can really do for certain is manually force the address mode from the info file, like I'm doing (although it could me more elegent that the data instruction I'm using).

As I attempt to disassemble the jetpac binary, we'll see how usable dasmfw is overall, as there may be many formatting issues that aren't going to work and this will probably be the same for all assemblers, not just as65. I guess it'll be a question of identifying the disassmebly elements that need a specifc format and then working out the best way of specifying them. Then, ideally, you'd be able to set a group of formats to define a specific assembler, but that's an ideal scenario, you may have other priorities!

Anyway, I'll push forward with the detail of my disassembly and report back progress.

BTW, I also found yet another way of denoting absolute or zeropage:

image

Direct page, data bank, program bank indexed and long addressing modes of instructions are intelligently
chosen based on the instruction type, the address ranges set up by [.dpage](http://tass64.sourceforge.net/#d_dpage),
[.databank](http://tass64.sourceforge.net/#d_databank) and the current program counter address. Therefore 
the ,d, ,b and ,k indexing is only used in very special cases.

The immediate direct page indexed #0,d addressing mode is usable for direct page access. The 8 bit 
constant is a direct offset from the start of actual direct page. Alternatively it may be written as 0,d.
<< lots more descriptions>>
Arakula commented 2 years ago

Then, ideally, you'd be able to set a group of formats to define a specific assembler, but that's an ideal scenario, you may have other priorities!

That is, in the long run, precisely what I plan to do. Solve the problem once and for all. But that will take time and careful planning, as there are so many options for the simplest things, even for the few disassemblers I have already implemented.

Arakula commented 2 years ago

I don't know why the programmer would want to force an absolute address, maybe for a special cycle-timing need, but otherwise I'm thinking you'd only ever do it sparingly.

One scenario comes to my mind: a one-pass assembler with some code on the zero page referencing data that comes a bit later. A two-pass assembler might flag this as a phase error.

Arakula commented 2 years ago

I've invested a silent hour into writing up some basics. Might as well share them with you, maybe you have some inputs ...

The basic idea is to provide a class that formats any output according to
the capabilities of a specific assembler.

The disassemblers would then format a line's contents as an array of items
and pass that to the Assembler class to format the output into lines matching
the selected assembler's methods.

Possible Items:
===============

text {cchar}
        Text covering the rest of the line.
        cchar would be a boolean that defines whether a leading comment
          character is to be printed.
        This item, if there, has to be the last in the array.

label {ldchar}
        label for the current instruction.
        ldchar would be a boolean that can be used to force output of the label
          delimiter character. This can be overridden if a hypothetical
          assembler always requires or doesn't support a label delimiter
          character.

instruction
        Assembler instruction (mnemonic or pseudo-op) to use.
        I'm not sure yet how this could be realized in a way that's useful, but
        does not overcomplicate everything. Would it be better to just pass the
        ID of a specific instruction and let the Assembler class generate the
        matching instruction, or should the mnemonic text be passed, and the
        output formatter only decides on upper- and lowercase?
        Presumably the first is better, but configuring that might become a
        nightmare.
        Possible solution: each disassembler for a specific processor gets a
        companion class that subclasses Assembler with a defined set of IDs and
        a default set of mnemonics which could be overridden in a configuration
        file if needed. Doesn't look too bad.

parameter
        One of the parameters used by the instruction.
        This is even trickier than mnemonic above. Not yet sure how to capture
        all the possible ways such a parameter can be passed. Also, what
        exactly is a parameter? Looking at the simple 6809 instruction
          LDA Base+1
        ... is that one parameter, or two with a given concatenation character,
        or is that a set of 3 parameters, the middle one defining an addition?
        Or, if "Base" is a known 16-bit word ... what is this then? A parameter
        plus an offset, or a reference to the low byte of the parameter? Some
        assemblers would be able to handle that, whereas others would require
        the "+1" semantic.
        Also, the addressing mode would have to be passed; this, however, can
        define how to output one parameter or a complete set of parameters -
        but not necessarily all of them.
        Another uncomfortable thing: forced addressing. This can, depending on
        the processor and the assembler, take some quite "interesting" forms,
        where either the mnemonic or the parameter is decorated in some way,
        or even both (like "an add instruction taking an 8- and a 16-bit
        parameter storing the result in a 32-bit register").
        Hmmm. Not easy. Obviously, some kind of hierarchy is needed.

That's it for today. Comments, precisions, etc. are very welcome.

phillipeaton commented 2 years ago

Perhaps the best/only way to really get a good specification up front of how the classes would work is to start with an in depth review of a number of assemblers and make a big table with how each aspect is handled, I'm thinking the core set of aspects is probably not that so big.

The agile approach would be to get one assembler working and them make it work for two different assemblers and make up the specification as you're going along.

The alternative approach would be to extend your own assembler.

Sorry I can't be of more help...but hopefully my feedback as I'm using dasmfw with as65 will be useful. Potentially I'll move to one of the other assemblers that can output a symbol file to MAME or VICE, but until then, I can probably fabricate something using awk.

phillipeaton commented 2 years ago

It would appear that the SB Assembler, which covers many CPUs and has a long history, does use > and < for forced absolute and zero page addressing on 6502. https://www.sbprojects.net/sbasm/6502.php

Arakula commented 2 years ago

I've added a crude method to deal with this now. Crude, as it isn't nearly as generic as I'd like it to be, but it should cover most of the possible ways to specify forced zero-page / absolute addressing. For assemblers that support ".a" and ".z" appended to the mnemonic, you'd need to set the new options

option forcezpgaddr m+.z
option forceabsaddr m+.a

(see syntax for that weird string pattern in dasmfw.htm). I hope that is good enough ...

... although I fear it isn't. At least for the 68HC11, I've come across an assembler that requires a * as parameter prefix if an address is to be forced to direct page addressing, as it would use extended addressing otherwise. That's a behavior that dasmfw doesn't really deal with at the moment (it assumes things work the other way - a decent assembler should use direct page addressing when possible and only use extended addressing when forced to).

phillipeaton commented 2 years ago

I notice this issue is still open.

My originally stated problem appears resolved now that dasmfw can force a particular addressing, either zero-page or absolute (in 6502 terms). However, it can only work successfully if the original assembler was forced to use absolute addressing I'm presuming a force usage of zero-page is unlikely, as it will not work if the jump target address is to far away.

Nonetheless, as you suggested, my problem with incorrect addressing is probably due to it appearing in what will likely be data areas, so I'm manually working around the issue without using forced addressing.

If you're using this issue as a placeholder for future work, fine by me to leave it open, but otherwise feel free to close the issue.

Arakula commented 2 years ago

You may be, but I'm not fully done with this issue yet. You see, while there's presumably no 6502 assembler that doesn't automatically use ZP addressing in doubt, I recently came across some old source code for a Motorola 68HC11E1 that obviously defaults to extended addressing, at least for data items defined with a RMB (or .ds in that one's syntax) instruction, and requires prefixing the parameter with a '*' if direct addressing is wanted.

While stupid (leads to an insane amount of s in the code and waste of space and CPU cycles because it's so easy to forget the , in which case the long form is generated), dasmfw currently can't easily reproduce that, except by putting an equally insane amount of forceaddr lines into the info file. I think I'll add a general "default to extended" option to dasmfw to make things easier; until that's done, I might as well leave this issue open.

phillipeaton commented 2 years ago

Possible incorrect disassembly of JMP (addr).

Working JETPAC binary in MAME debugger shows:

image

When disassembled with dasmfw (latest), source code shows:

image

as65 reassembled binary does not match original binary, an extra 00 is inserted:

image

Workaround is to use:

image

That gives:

image

Arakula commented 2 years ago

The extra inserted 00 is the "brk" that dasmfw invented (so the second 00 should be red). I'll look into it.

phillipeaton commented 2 years ago

Yes, it noticed that. BTW, my nfo file is here: https://github.com/phillipeaton/JETPAC_VIC-20_disassembly/blob/main/nfo_jetpac.nfo

You may recall I had many lines of data statements to stop zp/absolute address problems. I've taken another approach now...I've told dasmfw that everything is data and now I'm adding code statements. Now I only have three workarounds to make as described above. I was able to do this by playing the game with MAME Debugger and using the trackpc option, which highlights all of the code executed in a disassembly. I then added that to the nfo file and all of the zp/absolute issues went away (apart from the three mentioned above).

Arakula commented 2 years ago

OK, should be fixed in https://github.com/Arakula/dasmfw/releases/tag/v0.30

phillipeaton commented 2 years ago

OK, should be fixed in https://github.com/Arakula/dasmfw/releases/tag/v0.30

Tonight's test shows that it appears to be fixed, many thanks! 😀