Data alignment should wait for all consecutive .db directives

dewhisna commented 3 years ago

I'm working on adding AVR support to my Generic Code-Seeking Disassembler and one requirement I have for it is that any output it generates should be able to roundtrip back to the original binary via a designated compatible assembler. For the AVR targets, I've chosen your avra assembler and in general it has been working great for that. However, I have one issue that needs to be addressed.

I understand the need for the .db directive parsing for data in the code segment to be aligned to the processor opcode size, performed here: https://github.com/Ro5bert/avra/blob/de5272f53771e54112245b2aa72aeeb09a19147d/src/directiv.c#L842-L851

However, it should wait for all consecutive .db directives to be parsed before forcing the alignment. There are cases where the logical break of the data (due to things like needing labels for code referencing it or logical breaks at strings, etc) is on an odd-byte boundary and there will be following .db directives that completes the alignment. However, when reassembling, this assembler will see the odd number of bytes and force insert an extra 0x00 byte unnecessarily for each odd line and that then breaks the binary roundtrip back to the original image.

For example, suppose there's the following:

__c_2646:     .db     "|Pn:",0x00
__c_2637:     .db     "|FS:",0x00
__init:       clr     R1
              out     SREG_,R1
              ser     R28
              ldi     R29,0x08

Here, each .db directive is only 5 bytes, which would require alignment if they appeared alone. However, both together creates an even 10 bytes, which doesn't. They need to be broken into two lines, however, in order to get a label assigned to the second group to point to the address for the code to reference. But doing so causes the assembler to generate a Warning : A .DB segment with an odd number of bytes is detected. A zero byte is added. and insert an extra 00 byte for both lines, which then causes the resulting binary image from the assembly to no longer match the original because everything is shifted by these extra bytes (not to mention the extra memory consumed by this which could incorrectly push things off the end of available space).

This assembler should wait to do alignment only upon reaching the next line of code after a series of .db directives and insert the 00 byte then. In this example, the alignment check should should be delayed until the clr instruction and there if the current address isn't even, it should make the alignment fix.

Ro5bert commented 3 years ago

One of the goals of AVRA is to be compatible with AVRASM, and the behaviour you described is consistent with AVRASM. Indeed, the AVRASM manual says this explicitly:

If the expression-list [of a .db directive] contains an odd number of expressions, the last expression will be placed in a program memory word of its own, even if the next line in the assembly code contains a DB directive.

So, I'm not sure this should be fixed.

Without knowing much about what you're doing, perhaps you could work around this by merging consecutive .db directives and defining offsets with .equ (taking into account that labels are word-based, not byte-based).

dewhisna commented 3 years ago

Yes, running the .db field over (merging part of the content of the next field) and inserting an .equ is exactly how I've had to work around it. But it makes the resulting disassembly file somewhat difficult to read and looks a bit ridiculous, and it complicates my disassembler code to have it figure that out and inject the workaround. Here's a snippet from a disassembly output to illustrate its workaround:

__c_2607_lto_priv_39:
              .db     "Check Limits",0x00,'C'
.equ __c_2604_lto_priv_38 =       0x019D
              .db     "heck Door",0x00
__c_2601_lto_priv_37:
              .db     "Disabled",0x00,'E'
.equ __c_2598_lto_priv_36 =       0x01B1
              .db     "nabled",0x00,'C'
.equ __c_2595_lto_priv_35 =       0x01B9
              .db     "aution: Unlocked",0x00,0x27
.equ __c_2592_lto_priv_34 =       0x01CB
              .db     "$H",0x27,'|',0x27,'$X',0x27," to unlock",0x00,'R'
.equ __c_2588_lto_priv_33 =       0x01DF
              .db     "eset to continue",0x00,'['
.equ __c_2585_lto_priv_32 =       0x01F1
              .db     "MSG:",0x00,0x0D

The above roundtrips and works. However, each of these equates should really be a label on the last byte of the line before it. But it can't because of the rules of .db. And this is just a small sample from this one code file. The majority of the labels for the .db directives, not only in this code file, but all files being disassembled, need the workaround.

This example was created by taking the .elf file out of the compiler and feeding it through my disassembler. And if you take this disassembly and run it through AVRA, you then get the original binary content from the .elf file (i.e. it roundtrips from binary -> disassembler -> assembler -> binary).

I agree with your efforts to keep AVRA compatible with AVRASM. But I was hoping you'd consider adding an AVRA-only command-line switch or custom-directive that would allow merging of consecutive .db fields before doing the alignment to eliminate this workaround.

If you are wondering the purpose of my disassembler tools, it's for both reverse-engineering and code-recovery purposes. Now, most AVR projects are open-source projects to begin with, so reverse-engineering AVR code is often of limited importance. But code-recovery can be important.

My disassembler has been used successfully in the past to assist a company that managed to lose the source code to their current-shipping binary image. They had the source code for the previous versions and binary images of the previous versions and the current-shipping firmware image, but they needed to recover the source code they had lost for their current firmware.

The disassembler has a companion tool that I call the "fuzzy function analyzer" that uses a DNA sequence alignment algorithm to cross-analyze all of the disassembled functions from one image with all of the disassembled functions from another image and generate percentage of match, optimal-edit-scripts, symbol hit list analysis, etc. In other words, it's able to figure out when code is the same but only moved to different memory addresses or variables/symbols that are the same but have only shifted in memory due to the insertion of a few other variables/symbols and compare them across relocation.

Being able to take the disassembler output and reassemble it back to the original binary image it started with not only makes an ideal test suite for the disassembler, but it is essential in such code recovery projects when you start reworking things to get back to the original C-source code that it originated from.

It would be nice if the labels for the data could always be placed on the correct .db line without merging lines and injecting the .equ workarounds and not get the extra alignment bytes injected so that the binary content matches.

dewhisna commented 3 years ago

FWIW, I have looked through the AVRA code and I do realize that what I'm asking probably isn't a trivial change, due to how the addresses are tracked internally as word-addresses instead of byte-addresses. To implement this, I think it would require changing that logic to use byte-addresses instead and convert to word-addresses at the point of use, similar to how my disassembler tracks addresses as byte-addresses and does the word/byte conversion going to/from the operands.

So, I don't expect a quick fix or change for this forthcoming, but I did want to get it on your radar for consideration for future enhancements and to make you aware of its need and how AVRA is being used.

Ro5bert commented 3 years ago

Very interesting project!

I agree that .db directives should be handled in the way you describe, but doing so would require labels to be byte-based, which wouldn't merely be an implementation change; it would break compatibility with existing code, because labels can be used as integers in expressions. E.g., code using lpm usually manually converts word addresses to byte addresses. So, labels must remain as word addresses, at least from the user's point of view.

I suppose it's possible that labels could be byte addresses internally and implicitly divided by 2 whenever used in an expression (for backwards compatibility). Then you could access the byte value of a label using a new special assembler function or something. But that seems inelegant to me. What do you think?

I would consider adding an AVRA-specific variant of .db that works how you describe, but the label-on-an-odd-byte issue needs to be resolved first.

dewhisna commented 3 years ago

Thanks! I just did some searching and you are right. AVRASM does indeed use word-addressing for labels in code space. I can't believe they did that. It means there is no good way to hand-write assembly code and make use of packed data and handle knowing if a label should be on an odd-byte. And means you either can't have packed constants in code space or you have to invent some special magic in your code to know how to deal with it for conversions for the lpm instruction -- plus you have to remember to convert addresses for lpm.

I had just, wrongly, assumed that it would have used byte-addressing on all labels, as that would have certainly made much more sense, and been consistent with the gcc compiler and binutils (like readelf and objdump) and such. Then the user would have nothing special to do when writing code for the lpm instruction -- it would just work as-is. And the assembler would automatically handle the word alignment and word-address conversion on the packing of opcodes like jmp and call by just dropping the last bit.

I'm not sure how I feel about having a special assembler function or macro to access the byte-address version of labels. To me that seems backwards. I'll have to think about that one some more. I think I'd lean more towards a command-line argument that switched all labels to be byte-addresses instead. That wouldn't break existing code because they simply wouldn't use that argument when calling the assembler. And new code using the (IMO, more proper) byte-addressed labels wouldn't need to use any special functions/macros to access the byte-addressed labels and can just add the extra argument to the assembler command-line.

So I think a new command-line switch would be my vote.

dewhisna commented 3 years ago

Well, your assembler is certainly in good company. I was curious how other assemblers out there behave. So, I just tried the one in the ASxxxx suite by Alan Baldwin at Kent State University. Years ago when I was working on the 6811 part of this disassembler, I used his as6811 assembler for the roundtrip reassembly. He has asavr in that suite now, which I considered for this project, but I liked that yours conformed to the more "official" AVRASM and has a better .device directive and such.

Anyway, to do the test, I had to do some search-and-replace operations to change things like .dw to .word and .db to .byte, etc., to match that assembler's directives. When I give it the same file that I do AVRA (with an even number of bytes per line), it gives the same exact binary output. But when I break them on odd boundaries and insert labels, it assembles it just fine with no errors or warnings. However, it silently pads the previous line without warning. At least yours gives a nice warning message.

So it seems I'm out numbered and that all of the AVR assemblers behave this way. Maybe I shouldn't rock the boat.

Ro5bert commented 3 years ago

You're certainly right that byte addressing would have been better.

I guess adding some backwards compatible way to deal with byte addresses is still on the table, but discussion of that should take place in a separate issue. By the way, there's a partial rewrite of some AVRA internals underway, so if a feature to work with byte addresses were added, it would probably be released at the same time as that.

dewhisna commented 3 years ago

I was just thinking I would close this issue this morning, but you beat me to it. I agree with your assessment. And in actuality, it doesn't really affect the disassembly phase all that much -- mainly because these all have to be addressed indirectly through things like the lpm statement, which does the addressing through registers (unlike the lds instructions for instance) and that means the disassembler cannot resolve the exact address anyway -- since there's (for all practical purposes) an infinite way to derive the values in the 'Z' register.

The real reason for supporting these odd-byte labels is to assist in determining what code matches other similar code and where the compiler packed things on that odd-byte, which can factor into the percent match. And it would usually only have such labels if the image being disassembled is something like the .elf file with a symbol table. They are lost completely on the more common .hex files or if the image is read from the physical device.

Anyway, thanks for your time to look into this. And if you ever need a good disassembler for AVR, you know where to find one. :-)

Ro5bert / avra

Data alignment should wait for all consecutive .db directives #39