Open phillipeaton opened 2 years ago
I still have a lot of non-matching code between the original binary and the dissassembled/reassembled binary, so I will look into this a bit more and report back.
I looked through about 25% of my code and there appears to be three occurrences that create differeneces between the original binary and the dissassembled/reassembled binary (with as65):
I suspect that the missing bytes due to 1. are likely to be data items anyway, so I will add data statements to those and see how I get on.
I spent a couple of hours trying out a handful of different assembler to assemble the dasmfw output and it appears so far none of them are as compatibile as as65, perhaps because it's quite old, generally they use .word, .const and simlar to handle FCB, EQUs etc.
None of the assemblers are consistent with each other, so your idea of tailoring output to specific assembers is likely to be necessary.
Now I will push forward with as65 and attempt to get my dasmfw output source code to assemble and the look at other functionaly. I suspect I may then get other problems with incompatibilities, but let's see.
I came across another disassembler/assembler pair, namely BeebAsm/BeebDis (Pages 1 & 8) (although they are written by different people). BeebDis seems to have a similar philosophy to dasmfw....
BeebDis takes the philosophy that disassembly is a partly interactive process where, and that you will
need to run it several times against a piece of code modifying the parameters each time as you discover
the various areas of the code you are processing.
As such, BeebDis relies on the creation of a disassembly control file, which configures how disassembly
proceeds. This control file will contain directives that define the various areas of the disassembly
process.
...and includes some options for formatting output for an arbitrary assembler...
DefineByte <string>
How to define byte storage in the output code, defaults to ‘equb’ for 6502 and ‘fcb’ for 6809.
DefineWord <string>
How to define word storage (16 bits) in the output code, defaults to ‘equw’ for 6502 and ‘fdb’ for 6809.
I'm not sure how mature/compatible this BeebDis/BeebAsm combo is, but it's probably worth a look. It didn't seem to have the same depth of functionalality as dasmfw, but I only had a quick look.
I'll let you know how I get on with progress with as65.
- Missing bytes due to absolute addressing reassembled with zero page addressing.
- Jumps with targets at different locations due to the missing bytes caused by 1.
- Strings with differences because I'm converting the whole disassembled source code to lower case to make the register names lower case.
I suspect that the missing bytes due to 1. are likely to be data items anyway, so I will add data statements to those and see how I get on.
I suspected right. 26 data
lines later and I have 8k disassembled/reassembled identically. All I need to do is search/replace A, X and Y registers to lower case before assembling; I need an awk script :-)
Next step is to try some more dasmfw formatting commands.
What would be better in your opinion:
Convenience vs. flexibility vs. necessity ... if 100% of all use cases require the same setting for both, a separate option would be counterproductive ...
My gut feeling says one option only. Given the register name is included in the mnemonic name e.g. 3rd character of lda
, referring to other registers using a different case on the same line would seem inconsistent.
I don't think I've seen that inconsistency in any of the docs for assemblers I've looked at recently, except in the DASM manual, though I don't believe it's a requirement to enter code that way.
The manual states case-insensitive for several features, although not regarding register names specifically:
I've uploaded a bigger update plus executable in https://github.com/Arakula/dasmfw/releases/tag/v0.26 that ...
Thanks for the code update! I can confirm that my raw dasmfw v0.26 disassembly now reassembles successfully with as65, without any manual changes to the diassembled source code. I'm using 26 dasmfw data
instructions to avoid any zero-page issues and option upmnemo off
.
I reviewed your "TODO" header text. From what I have ascertained recently, "<" and ">" are often used to assemble the top or bottom byte of an assembler variable, for an 8-bit CPU, I guess that could be quite useful.
Regarding using $00 or $0000 to decide on absolute or zeropage, I don't think that's a bad solution to the problem if that's all that the assembler will accomodate, even if it's just a default method (though I appreciate it might be difficult to implement). However, I would question how robust a disassembly solution can ever be for absolute or zeropage addressing. I don't know why the programmer would want to force an absolute address, maybe for a special cycle-timing need, but otherwise I'm thinking you'd only ever do it sparingly. Unfortunately that will fool the disassembler often and you can never do anything about having the issue in data blocks. All you can really do for certain is manually force the address mode from the info file, like I'm doing (although it could me more elegent that the data
instruction I'm using).
As I attempt to disassemble the jetpac binary, we'll see how usable dasmfw is overall, as there may be many formatting issues that aren't going to work and this will probably be the same for all assemblers, not just as65. I guess it'll be a question of identifying the disassmebly elements that need a specifc format and then working out the best way of specifying them. Then, ideally, you'd be able to set a group of formats to define a specific assembler, but that's an ideal scenario, you may have other priorities!
Anyway, I'll push forward with the detail of my disassembly and report back progress.
BTW, I also found yet another way of denoting absolute or zeropage:
Direct page, data bank, program bank indexed and long addressing modes of instructions are intelligently
chosen based on the instruction type, the address ranges set up by [.dpage](http://tass64.sourceforge.net/#d_dpage),
[.databank](http://tass64.sourceforge.net/#d_databank) and the current program counter address. Therefore
the ,d, ,b and ,k indexing is only used in very special cases.
The immediate direct page indexed #0,d addressing mode is usable for direct page access. The 8 bit
constant is a direct offset from the start of actual direct page. Alternatively it may be written as 0,d.
<< lots more descriptions>>
Then, ideally, you'd be able to set a group of formats to define a specific assembler, but that's an ideal scenario, you may have other priorities!
That is, in the long run, precisely what I plan to do. Solve the problem once and for all. But that will take time and careful planning, as there are so many options for the simplest things, even for the few disassemblers I have already implemented.
I don't know why the programmer would want to force an absolute address, maybe for a special cycle-timing need, but otherwise I'm thinking you'd only ever do it sparingly.
One scenario comes to my mind: a one-pass assembler with some code on the zero page referencing data that comes a bit later. A two-pass assembler might flag this as a phase error.
I've invested a silent hour into writing up some basics. Might as well share them with you, maybe you have some inputs ...
The basic idea is to provide a class that formats any output according to
the capabilities of a specific assembler.
The disassemblers would then format a line's contents as an array of items
and pass that to the Assembler class to format the output into lines matching
the selected assembler's methods.
Possible Items:
===============
text {cchar}
Text covering the rest of the line.
cchar would be a boolean that defines whether a leading comment
character is to be printed.
This item, if there, has to be the last in the array.
label {ldchar}
label for the current instruction.
ldchar would be a boolean that can be used to force output of the label
delimiter character. This can be overridden if a hypothetical
assembler always requires or doesn't support a label delimiter
character.
instruction
Assembler instruction (mnemonic or pseudo-op) to use.
I'm not sure yet how this could be realized in a way that's useful, but
does not overcomplicate everything. Would it be better to just pass the
ID of a specific instruction and let the Assembler class generate the
matching instruction, or should the mnemonic text be passed, and the
output formatter only decides on upper- and lowercase?
Presumably the first is better, but configuring that might become a
nightmare.
Possible solution: each disassembler for a specific processor gets a
companion class that subclasses Assembler with a defined set of IDs and
a default set of mnemonics which could be overridden in a configuration
file if needed. Doesn't look too bad.
parameter
One of the parameters used by the instruction.
This is even trickier than mnemonic above. Not yet sure how to capture
all the possible ways such a parameter can be passed. Also, what
exactly is a parameter? Looking at the simple 6809 instruction
LDA Base+1
... is that one parameter, or two with a given concatenation character,
or is that a set of 3 parameters, the middle one defining an addition?
Or, if "Base" is a known 16-bit word ... what is this then? A parameter
plus an offset, or a reference to the low byte of the parameter? Some
assemblers would be able to handle that, whereas others would require
the "+1" semantic.
Also, the addressing mode would have to be passed; this, however, can
define how to output one parameter or a complete set of parameters -
but not necessarily all of them.
Another uncomfortable thing: forced addressing. This can, depending on
the processor and the assembler, take some quite "interesting" forms,
where either the mnemonic or the parameter is decorated in some way,
or even both (like "an add instruction taking an 8- and a 16-bit
parameter storing the result in a 32-bit register").
Hmmm. Not easy. Obviously, some kind of hierarchy is needed.
That's it for today. Comments, precisions, etc. are very welcome.
Perhaps the best/only way to really get a good specification up front of how the classes would work is to start with an in depth review of a number of assemblers and make a big table with how each aspect is handled, I'm thinking the core set of aspects is probably not that so big.
The agile approach would be to get one assembler working and them make it work for two different assemblers and make up the specification as you're going along.
The alternative approach would be to extend your own assembler.
Sorry I can't be of more help...but hopefully my feedback as I'm using dasmfw with as65 will be useful. Potentially I'll move to one of the other assemblers that can output a symbol file to MAME or VICE, but until then, I can probably fabricate something using awk.
It would appear that the SB Assembler, which covers many CPUs and has a long history, does use > and < for forced absolute and zero page addressing on 6502. https://www.sbprojects.net/sbasm/6502.php
I've added a crude method to deal with this now. Crude, as it isn't nearly as generic as I'd like it to be, but it should cover most of the possible ways to specify forced zero-page / absolute addressing. For assemblers that support ".a" and ".z" appended to the mnemonic, you'd need to set the new options
option forcezpgaddr m+.z
option forceabsaddr m+.a
(see syntax for that weird string pattern in dasmfw.htm). I hope that is good enough ...
... although I fear it isn't. At least for the 68HC11, I've come across an assembler that requires a * as parameter prefix if an address is to be forced to direct page addressing, as it would use extended addressing otherwise. That's a behavior that dasmfw doesn't really deal with at the moment (it assumes things work the other way - a decent assembler should use direct page addressing when possible and only use extended addressing when forced to).
I notice this issue is still open.
My originally stated problem appears resolved now that dasmfw can force a particular addressing, either zero-page or absolute (in 6502 terms). However, it can only work successfully if the original assembler was forced to use absolute addressing I'm presuming a force usage of zero-page is unlikely, as it will not work if the jump target address is to far away.
Nonetheless, as you suggested, my problem with incorrect addressing is probably due to it appearing in what will likely be data areas, so I'm manually working around the issue without using forced addressing.
If you're using this issue as a placeholder for future work, fine by me to leave it open, but otherwise feel free to close the issue.
You may be, but I'm not fully done with this issue yet. You see, while there's presumably no 6502 assembler that doesn't automatically use ZP addressing in doubt, I recently came across some old source code for a Motorola 68HC11E1 that obviously defaults to extended addressing, at least for data items defined with a RMB (or .ds in that one's syntax) instruction, and requires prefixing the parameter with a '*' if direct addressing is wanted.
While stupid (leads to an insane amount of s in the code and waste of space and CPU cycles because it's so easy to forget the , in which case the long form is generated), dasmfw currently can't easily reproduce that, except by putting an equally insane amount of forceaddr
lines into the info file. I think I'll add a general "default to extended" option to dasmfw to make things easier; until that's done, I might as well leave this issue open.
Possible incorrect disassembly of JMP (addr)
.
Working JETPAC binary in MAME debugger shows:
When disassembled with dasmfw (latest), source code shows:
as65 reassembled binary does not match original binary, an extra 00
is inserted:
Workaround is to use:
That gives:
The extra inserted 00 is the "brk" that dasmfw invented (so the second 00 should be red). I'll look into it.
Yes, it noticed that. BTW, my nfo file is here: https://github.com/phillipeaton/JETPAC_VIC-20_disassembly/blob/main/nfo_jetpac.nfo
You may recall I had many lines of data
statements to stop zp/absolute address problems. I've taken another approach now...I've told dasmfw that everything is data
and now I'm adding code
statements. Now I only have three workarounds to make as described above. I was able to do this by playing the game with MAME Debugger and using the trackpc
option, which highlights all of the code executed in a disassembly. I then added that to the nfo file and all of the zp/absolute issues went away (apart from the three mentioned above).
OK, should be fixed in https://github.com/Arakula/dasmfw/releases/tag/v0.30
OK, should be fixed in https://github.com/Arakula/dasmfw/releases/tag/v0.30
Tonight's test shows that it appears to be fixed, many thanks! 😀
Following on from the
forceaddr off
issue, I had a problem with as65 not assembling and that's why I was attempting workarounds withforceaddr
:A bit of research shows that the 6809-style >z009d isn't used for 6502 assembly. Here are some assembler manual links that describe how they deal with absolute and zeropage addressing.
This in the best description: ACME Assembler
This is also good: KickAssembler
And this one: DASM (PDF page 61/62, manual page 52/53)
And one more that shows it in action:
Generally it seems, there are two ways of managing absolute/zero page addressing:
.z
or.a
to thelda
.It seems to me that all the assemblers I looked at recognize 1. and newer assemblers recognize 1. & 2. There appear to be other forcing parameters also that some assemblers recognize, but I'm not sure they're really necessary, Kick Assembler has specifically deprecated them all, apart from
.z
/.a
.The as65 assembler specifically complains about the
jsr >z009d
, probably becausejsr
is always uses a 16 bit address. I get fourjsr
errors in my listing, but there are many other instructions with this address mode e.g.asl >m0000
that do not throw an error. It would appear '>' is valid for addresses, but, from what I can tell, it's for manipulating the address data at assemble time, not selecting address mode. I still have a lot of non-matching code between the original binary and the dissassembled/reassembled binary, so I will look into this a bit more and report back.