Problems with multi-bank references

baldengineer commented 1 year ago

I am disassembling the Apple IIc ROM Rev3 (345-0445-A), bank 1. Starting in $C800, I am getting a bunch of instructions decoded with a "▼." Which leads to a confusing listing.

+00080C C80C: AE 66        ??--di?? .....                 ldx ▼ MOUXL           ;Get mouse info
+00080E C80E: C0           ??--di?? ....>                 cpy ▼ #$AC
+00080F C80F: AC           ??--di?? .....                 ldy ▼ LC000+103       ;As soon as we can
+000810 C810: 67           ??--di?? ....> LC810           nop
+000811 C811: C0           ??--di?? .....                 cpy ▼ #$D8
+000812 C812: D8           ??--di?? .....                 cld                   ;+ No decimal mode please
+000813 C813: 29 10        ??--di?? .....                 and   #$10            ;+ Test break bit
+000815 C815: C9           n?--di?? .....                 cmp ▼ #$10            ;+ C=1 if break. V unchanged
+000816 C816: 10           ??--di?? ....>                 bpl ▼ $C7C5
+000817 C817: AD           ??--di?? .....                 lda ▼ LC000+24
+000818 C818: 18           N?--di?? .....                 clc
+000819 C819: C0           N?--di?c .....                 cpy ▼ #$2D
+00081A C81A: 2D           ??--di?? .....                 and ▼ LC000+28
+00081B C81B: 1C C0        ??--di?? .....                 trb ▼ $29C0
+00081D C81D: 29           ??--di?? .....                 and ▼ #$80
+00081E C81E: 80           ??--di?? ...#.                 bra ▼ LC810

At $C80C, the instruction should be AE 66 C0 which would decode to ldx $C066. I created a label for MOUXL. And $C80F should decode to ldy $C067. And so on.

I'm confused why this is happening. I've removed analyzer tags, formatting, and even tried starting as "inline data" before setting the code start point.

I must be doing something wrong, but I cannot figure out what.

fadden commented 1 year ago

The "attr" column for $c80e has a ">", which means $c80e (the 3rd byte of the LDX) is a branch target. If you select the line, the References window in the top left will tell you what is referencing that address. Because of the branch, the disassembler sees a path through the code that passes through the end of the opcode.

$c80e is interpreted as a two-byte LDY, which eats into the 3-byte LDY at $c80f, creating another mid-instruction opcode. This propagates for a bit, gets a break at the CLD, then starts up again when something branches into $c816.

The downward arrows are there to let you know that there are instructions with opcodes in the middle of them. Some code does this deliberately; there's an example near the bottom of https://6502bench.com/sgtutorial/odds-ends.html .

If you want to send me the project (e-mail or attach here) I may be able to tell you more.

baldengineer commented 1 year ago

Ah okay, I did not know what the > in attributes meant. (And I completely missed the references panel!)

The issue is there are jmps to Bank 2, which the analyzer is linking together.

I haven't figured out how to handle the 2nd bank yet, but at least I understand the overall behavior better now.

Thanks!

fadden commented 1 year ago

I figured it was something like that. I had a similar issue while fiddling with Metroid (https://6502disassembly.com/nes-metroid/)... 8 banks of ROM, 7 mapped to the same address, code in the first bank referencing entry points in 5 out of the other 7.

Some issues can be resolved by setting the address regions appropriately. Sometimes you have to set the operand symbols explicitly, because there are multiple identical addresses. SourceGen treats the address map as a tree and does a depth-first search, but that doesn't disambiguate all situations.

Another example: ProSel's CAT.DOCTOR (https://6502disassembly.com/a2-prosel8/) does a bunch of relocations, so that map got pretty interesting. FWIW, the ASCII-art address map is generated by View > Show Address Map.

baldengineer commented 1 year ago

Thanks! I'll take a look at how those projects are structured. I know I need to figure this out, but it's not my primary focus for doing this exercise. So, I'll come back and probably have more questions. :)

FWIW, here are the project files I have been working on. The "bank 1" project is my third start of this process. I keep learning the things I did wrong midway through. :)

fadden commented 1 year ago

A few notes that may be helpful...

You don't need to put the .sym65 files in the SourceGen RuntimeData directory. You can just put them next to your project file, and use "add symbol files from project" instead of "add symbol files from runtime". (They also shouldn't be copyright faddenSoft, since you wrote them.) This would remove the manual installation step from the download.

The relevant manual file section (hit F1 and find the "Platform Symbol Files (.sym65)" section, or open RuntimeData/Help/advanced.html) has some additional details on the .sym65 format that may be helpful. For example, addresses and constants are specified differently, so the address resolver doesn't try to use the constants, and you can specify different symbols for read vs. write operations on memory-mapped I/O locations.

I couldn't really play with the project because I didn't find a ROM binary. The html output looks like you're making good progress.

The stretch at $c780 caught my eye because of the 24-bit math:

C780: 8D 28 C0     swrti           ADR     ROMBANK+$BF6865   ;RTI to the other bank

Looks like that got turned into data rather than code (those are alternating STA/JMP). A couple of them are referenced with JSRs, the others might need code start tags on the $8Ds.

Mapping the chunk at $c000-c0ff to a different address (or no address at all) might be necessary if you want the address resolver to find the project/platform symbols for the I/O addresses. SourceGen prioritizes in-file addresses over external addresses.

baldengineer commented 1 year ago

Thanks for the follow-up. Regarding the license, oops! I intended to clean those files up before sharing them (and then forgot.)

I'll re-review the manual on symbol files again. Your instructions make more sense now that I've created and used them a bit.

The issue around $C780 is because I flattened the code in that area to be inline data. It is all stuff that jumps to the second bank. So, by effectively ignoring it, the rest of the project is easier to read. (Side note, for the immediate goal I am trying to accomplish, I just need to see when code jumps to that block. It's just a jump table anyway, so I only care when I see things going to that address range. Those are the jumps I need to patch around.)

I'll keep working to understand how to do address mapping. Thanks again!

fadden commented 1 year ago

Ah...

C760: 4C 0E C8                     ADR     fixlc+$C7463E

That's what caused the problem in the initial report.

fadden commented 1 year ago

I've been thinking about the issue of references to overlapping banks. The problem at hand is that there is a reference to an address (such as $c80e) that exists in more than one place. The difficulty is that the disassembler's code analyzer wants to map that address to a file offset. There are three basic scenarios:

The target offset doesn't exist in the file. For your project, $c80e can only be mapped to the offset of the wrong $c80e, because it's the only place that address exists.
The target offset does exist in the file. Multi-bank situations are easy to handle when the only references are to locations within the same address region, because the address-to-offset resolver has a notion of scoping rules, and will bind to the closest thing. It's more difficult to handle when the references cross banks, because there's no way to tell the code analyzer which of the locations to use.
The target offset exists in multiple places. This was the case in Metroid, where the "game engine" chunk made calls to $80b0, which existed in 5 of the 7 ROM banks mapped into $8000-bfff. (Annoyingly, the bank that the code analyzer decided to use as the target wasn't one that had an entry point at $80b0, leading to a situation like we have here.)

Ideally it would be possible to add something to the operand that told it which of the various addresses were the correct ones, so that the code analyzer could automatically visit all of them. The operand editor would need to have a list of checkboxes, one per potential target offset. In practice this is probably more confusing and more work than just adding a code start tag at those offsets, and would be difficult to maintain if the address map was updated. The one clear advantage it has is that the References list would be correct.

In theory we could use a symbol specified for the operand as a signal. If the operand is given a symbol that is defined in a different part of the address map, we could start the offset resolution process in that region instead of the instruction's region. This doesn't help with multiple targets though, and I'm not sure how this would affect existing behavior. (Also, we don't normally apply labels until after the code analyzer runs.)

A simpler approach would be to add a "do not follow" checkbox for absolute branch instructions (JMP/JSR). If set, the code analyzer simply doesn't follow the trail. For this project, the box could be checked on the various JMP instructions to eliminate the mid-instruction execution seen in the initial problem report. This isn't ideal, but it's fairly straightforward, and eliminates the annoying multi-path code issue.

I've added a TO DO list item for this.

fadden commented 5 months ago

This might actually work with the "isolated region" concept from issue #139. $c780-c7ff in each bank would be marked as isolated so that it didn't try to resolve symbols in the current bank.

I think there's still value in a "do not follow" checkbox for fixing up individual items, but considering multi-bank NES games like Metroid, I think when it comes to ROM banking there are segments that "reach out" and segments that expect to be reached into.

fadden commented 5 months ago

Here's a quick project using the new address space isolation features to put the entire ROM in a single project. I did a rough setup with the regions on the 32KB ROM file:

rom-region-example.ZIP

Use Navigate > View Address Map to see an overview of the region structure...

Address region map for "03-342-0445-A.bin"

+000000  +- start 'BANK0' [!in] [!out]
+000000  | +- start 'BANK0'
         | |  -NA-  length=256 ($0100)
+0000ff  | +- end
         | 
         |  $c100 - $c6ff  length=1536 ($0600)
+000700  | +- start [!out]
         | |  $c700 - $c7ff  length=256 ($0100)
+0007ff  | +- end
         | 
         |  $c800 - $cfff  length=2048 ($0800)
+001000  | +- start [!in]
         | |  $d000 - $f7ff  length=10240 ($2800)
+0037ff  | +- end
         | 
         |  $f800 - $ffff  length=2048 ($0800)
+003fff  +- end

+004000  +- start 'BANK1' [!in] [!out]
+004000  | +- start 'BANK1'
         | |  -NA-  length=256 ($0100)
+0040ff  | +- end
         | 
         |  $c100 - $c77f  length=1664 ($0680)
+004780  | +- start 'bank_swp_table' [!out]
         | |  $c780 - $c7ff  length=128 ($0080)
+0047ff  | +- end
         | 
         |  $c800 - $dbff  length=5120 ($1400)
+005c00  | +- start 'StartTest' [!in]
         | |  $2000 - $23ff  length=1024 ($0400)
+005fff  | +- end
         | 
+006000  | +- start [!in]
         | |  $e000 - $ffff  length=8192 ($2000)
+007fff  | +- end
         | 
+007fff  +- end

The isolation feature prevents the 16KB banks from being aware of each other, and prevents the $c7xx code from trying to create symbols in the current bank. I put the page at $c000 in non-addressable space, since it's actually memory-mapped I/O. I also wrapped the Applesoft area just so I could slap a "junk bytes" on the entire thing.

Note that SourceGen commands like Goto (Ctrl+G), when given an address, will jump to the matching address closest to the selected line. So jumping to "c700" will go to either the first or second bank depending on where you start from.

I think this addresses the problems you were having.

fadden / 6502bench

Problems with multi-bank references #147