Logbook for version 0.1

dirkwhoffmann commented 5 years ago

I am going to use this thread as a logbook in the near future to document the progress towards version 0.1. V0.1 should have about the same functionality as UAE 0.1. We'll then have a (completely useless) emulator that can do nothing but show the Workbench initial screen. To speak with a picture:

This is the current situation:

We have

a CPU (Musashi core),
memory,
two (yet unconnected) CIAs.

So let’s see how far we can get with this. These are the first lines of Kickstart 1.2 which I want to step through:

FC00D2  lea       040000,SP         Set stack pointer to top of first 128K.
FC00D8  move.l    #$020000,D0
FC00DE  subq.l    #1,D0             Delay loop.
FC00E0  bgt.s     FC00DE

        ; If the ROM is also visible at F00000, or if there is another
        ; ROM there, jump there.

FC00E2  lea       FC0000(PC),A0     Load base address of ROM we're in.
FC00E6  lea       F00000,A1         Load (absolute address) F00000.
FC00EC  cmp.l     A1,A0             Are we at F00000?
FC00EE  beq.s     FC00FE            If so, don't execute the following.
FC00F0  lea       FC00FE(PC),A5     This is relative, i.e. always points
                                    12 bytes down from where we are.
FC00F4  cmp.w     #$1111,(A1)       If "1111" not found at F00000, then
FC00F8  bne.s     FC00FE            continue running below, else start
FC00FA  jmp       2(A1)             running at F00002.

        ; Set up port A on the first CIA (8520-A).

FC00FE  move.b    #3,BFE201         Set low two bits for output.
FC0106  move.b    #2,BFE001         Set boot ROM off, power light dim.

Before we can get started, we need to install the Kickstart Rom. This is done in the hardware preferences. By default, the Aros replacement Rom is installed.

It can be replaced by an original Rom via drag & drop, so why stick to the clone if we can have the real stuff 😎:

The Kickstart Rom is usually located in the upper memory area. On startup, the Amiga mirrors it in the lower memory banks to enable the CPU to find the correct start vector. The memory inspector shows the details:

When powering on the Amiga, the CPU loads the start vector from the mirrored Kickstart Rom and jumps to address FC00D2. For testing purposes I let the emulator stop at FC00DE at a predefined breakpoint which can be watched in the CPU panel:

Let’s set another breakpoint at FC00FE by double clicking the corresponding line in the program window:

By pressing the Run button the CPU starts and stops at FC00FE.

Pretty nice so far 🥳, but at this point the Kickstart Rom writes into the CIA registers 🙁. Two CIAs are already present in the current implementation, but they are not yet connected to memory. Therefore I have to stop here. I'll continue this thread once the CIAs are connected. Stay tuned ...

mithrendal commented 5 years ago

The blitter draws the complete picture (Workbench initial screen) with multiple single draw operations as far as I read somewhere. I think you will have to implement at least the blitter logic for it then. That would mean the first goal is already set pretty high ambitious ...... nice goal dirk !! I totaly like this logbook story format ;-) ...

dirkwhoffmann commented 5 years ago

You are right. Although V0.1 won't do much from a users perspective, a lot under-the-hood stuff needs to work to make the image appear.

Right now, I'm trying to find out how the DMA time slot allocation is implemented in SAE (the Javascript UAE clone):

It must be a core piece of the emulator, but I didn't find it yet. What I did find is two event tables: eventtab and eventtab2.

The first one covers four events:

 SAEC_Events_EV_CIA
 SAEC_Events_EV_HSYNC
 SAEC_Events_EV_AUDIO
 EV_MISC

The second one covers two:

 SAEC_Events_EV2_BLITTER
 SAEC_Events_EV2_DISK

I guess the Disk, Blitter and Audio DMA cycles are implemented within the event handlers. Unfortunately, I didn't find the code fragments that implement Sprite and Bitplane DMA. It is quite difficult to crawl through the UAE or SAE code because there are hardly any comments and the code is far from self-explanatory.

mithrendal commented 5 years ago

grafik

I attached a crisper image because in your pic the slots for 68k,blitter, copper are hard to distinguish from the 320 mode bitplane DMA.

taken from bloodline blog where he explains how he would implement the DMA sequencer (watch out for Theory time) http://eab.abime.net/showthread.php?t=90316&page=9

detailing information about DMA sequence http://amigadev.elowar.com/read/ADCD_2.1/Hardware_Manual_guide/node012B.html

dirkwhoffmann commented 5 years ago

Hmm, in SAE there is a lot of stuff going on in functions hsync_handler_pre() and hsync_handler_post(). It seems like SAE is not cycle-accurate, but line-accurate 🤔. I'm a little bit confused right now.

mithrendal commented 5 years ago

I found something in custom.js the function DMACON(v, hpos) looks maybe like a dispatcher for DMA slots... it looks like it starts copper actions, bitplane_DMA and blitteroperation depending on hpos of a horizontal line like documented in the hardware ref manual...

what is this readmap and writemap thing? it seems like a function map depending sort of depending on hpos?

in winuae it is different, there in custom.cpp is a function "static int dma_cycle (void)" which maybe seems to do the dma sequencing ? Or am I wrong...

dirkwhoffmann commented 5 years ago

DMACON and DMACONR seem to be the read and write handlers for the DMACON register. They only occur in the read and write maps for the OCS registers:

readMap[0x002 >> 1] = DMACONR;
writeMap[0x096 >> 1] = DMACON;

Thanks a lot for the links to the DMA sequencer forum thread. Might be very useful for us!

dirkwhoffmann commented 5 years ago

I've found this in SAE:

this.events_dmal_hsync = function() {
 ...
        SAER.events.event2_newevent_xx(-1, 7 * SAEC_Events_CYCLE_UNIT, 13, function(v) {
            while (dmal) {
                if (dmal & 3)
                    dmal_emu(dmal_hpos + ((dmal & 2) ? 1 : 0));
                dmal_hpos += 2;
                dmal >>>= 2;
            }
        });
    }

This is called after every HSYNC. It schedules an event that runs dmal_emu in a loop. dmal_emu performs disk or audio DMA, depending on the horizontal position. This means that the CPU cannot interfere because the DMA is emulated in a single chunk. This indicates line-accuracy, but I always thought that UAE is cycle-accurate.

I'm wondering how bitplane data is read in...

dirkwhoffmann commented 5 years ago

More findings:

An HSYNC event calls hsync_handler_pre() which calls finish_decisions()which calls decide_sprites() which calls record_sprite()which process sprite data for all pixels in one row (or fewer pixels if it is called in the middle of a rasterline).

Interestingly, decide_sprites() is also called in the write handler SPRxPTH. This indicates that UAE / SAE uses the following design principle:

If all OCS registers are stable, all OCS activity is emulated in one chunk after each rasterline.
If an OCS register changes, the OCS actions are emulated up to the current horizontal position, the value is written, and the rest is emulated at HSYC.

At least it seems to work this way for sprites.

I'm pretty much in favour of emulating DMA access as it is explained in the Raperry Pie Forum thread (VICII in VirtualC64 works the same way). It would be slower, but it seems simpler and less error-prone.

mithrendal commented 5 years ago

Raspberry Pie Forum ?
you refer to English Amiga Board > Coders > Coders. Asm / Hardware > Baremetal Amiga Emulator don't you ? This is no raspi-forum, it is the "English Amiga Board" an resource of loads of information around the Amiga including WinUAE, FSUAE, Coding, Software, Hardware, etc . ;-)

I totally agree with you, I also think the way which bloodline is going is my favourite. It is simpler, cleaner approach and maybe a bit slower. I bet that decisions stuff came from a time where computers where much slower and could barely handle cycle exact Amiga Emulation, so they had to optimise and pay that with complicated code. (of course that optimisation is still reasonable for javascript which is slow, but for clang compiled code not necessary anymore I bet)

dirkwhoffmann commented 5 years ago

The CIAs are connected to memory now, so we are able to process a few more instructions. We are here at the moment:

       ; Set up port A on the first CIA (8520-A).

FC00FE  move.b    #3,BFE201         Set low two bits for output.
FC0106  move.b    #2,BFE001         Set boot ROM off, power light dim.

The first move instruction configures pins PA0 and PA1 of CIAA as output and the second move instruction sets PA0 to 0 and PA1 to 1. PA0 controls the Kickstart overlay, and writing a 0 means that the Kickstart should no longer be overlayed. When the two instructions are executed, the memory panel shows that this is indeed the case. We now see the Chip Ram blended in:

PA1 controls the LED. The LED is switched off, but because the LED was already switched off before, we don’t see anything of interest.

The next important instructions are:

FC0118  move.w    D0,$9A(A4)        Disable all interrupts.
FC011C  move.w    D0,$9C(A4)        Clear all pending interrupts.
FC0120  move.w    D0,$96(A4)        Disable all DMA.

The first two statement reset the interrupt registers which can be verified in the Paula panel:

I agree that it’s not that spectacular, because 0 is also the initial value on startup. However, a temporary debug message in the console tells me that the registers have indeed been written to:

Paula: pokeINTENA(7FFF)
Paula: pokeINTREQ(7FFF)

The next instruction produces

Memory: WARNING: pokeCustom16(DFF096, 7FFF): MISSING IMPLEMENTATION

So it’s clear what to implement next: The DMACON register inside Agnus…

dirkwhoffmann commented 5 years ago

Now I can step through until I reach the Expansion RAM Checker at

FC061A  move.l    A0,A4

Good to have a commented Kickstart around:

    ; $C00000 Expansion RAM Checker
    ; -----------------------------

    ; The following routine checks for the presence of memory
    ; in the $C00000 - $DC0000 area.  This is a nontrivial exercise,
    ; since if there is no memory there, we see images of the custom
    ; chip registers there instead, due to incomplete address decoding.

Incomplete address decoding 😳. Never heard about it 🤭. If I'm right, UAE handles unmapped memory via the "dummy" bank handlers ... Let's see what they do there ...

dirkwhoffmann commented 5 years ago

I've found a Verilog reimplementation of Gary here (Amiga FPGA project):

https://github.com/rkrajnc/minimig-de1/blob/master/rtl/minimig/Gary.v

According to the line

assign sel_reg = cpu_address_in[23:21]==3'b110 ? ~(sel_xram | sel_rtc | sel_ide | sel_gayle) : 1'b0;

"incomplete address encoding" means that Gary selects the custom registers if the upper three address bits match. I've adapted this and the new memory mapping now looks like this (A500 with 512 KB slow mem, some fast mem and a RTC attached):

dirkwhoffmann commented 5 years ago

Now the emulator detects correctly if a Chip Ram extension is present (Slow Ram starting at memory bank C0). If memory is found, it is initialised with zeroes 🥳.

Next step will be:

  ; Having figured out the end address of expansion memory (in A4),
    ; and the value to use for ExecBase (in A6), we now check how much
    ; chip memory we have.  Any memory in the first 2 megabytes of
    ; address space is considered to be chip memory.  Less than 256K
    ; of chip memory is considered a fatal error.

FC0208  lea       0,A0              Start looking at location 0.
FC020C  lea       200000,A1         Don't look past 2 megabytes.
FC0212  lea       FC021A(PC),A5     Set the return address.
FC0216  bra       FC0592            Go check the memory.
FC021A  cmp.l     #$040000,A3       Do we have at least 256K of chip memory?
FC0220  bcs.s     FC0238            Bomb if not.

Unfortunately the emulator tends to beach ball if the inspector is open while the main window is in the background. This a Mac related problem and due to the fact that I'm not really familiar with handling auxiliary windows in OS X. I might look into this first before I continue here ...

dirkwhoffmann commented 5 years ago

After removing a stupid bug in Memory::poke32(), Kickstart has decided that the machine has 256 KB of Chip Ram (which is good, because it has, well, 256 KB of Chip Ram). Then, it recognised that the CPU is a 68000 (by ruling out that it is a 68010 etc.). After that a lot of memory init stuff is done (setting up exec jump tables etc.). This all looks all good (as far as I can judge this at the moment), so I'm finally here:

   ; A historic moment:  We turn the supervisor mode flag off.

FC04BE  and.w     #0,SR             Turn the supervisor bit off.

Wow, a historic moment 🤭. I am a bit scared of what will happen. OK, let's be brave and press the Step button again 😬:

Woohoo, supervisor flag is cleared 😀. After experiencing this "historic moment", I need a break. Stepping through Kickstart is exhausting ...

Just noticed that the "Data" column is wrong in the CPU panel. Need to fix this first ...

mithrendal commented 5 years ago

Thats an completely epic moment for all of us 🖖 ! 🤗

That is the documented kickstart exec of markus wandel. Isn’t it? He wrote that comments in February 3, 1989 so thats clearly a historic moment in February 2019. 🙃

From this time on, we are leaving supervisor mode and running in 68K user mode... certain 68k commands like "stop" or "reset" do not work from this moment on... that makes sense because AmigaOS is a multitasking system ...

dirkwhoffmann commented 5 years ago

I continued my journey through Kickstart. Unfortunately, I wasn't aware of the fact that Markus Wandel "only" documented exec, so at some point, the unavoidable happened: I left exec and entered the undocumented area. This means I'm in outer space now and completely left on my own 👽😬. I kept on stepping and at some point in time, the emulator started to poke values into the Copper registers. So it was about time to work on that.

To make a long story short: I don't have a working Copper yet, but I do have a Copper disassembler 😎. Along the way, I've also invented a new software development approach which I'm gonna call "inverse prioritising". It's core idea is to postpone the most import things as much as possible. I'm so proud of this method that I'm considering to publish a book about it. The only thing that puzzles me is that nobody else had this brilliant idea before 🤔.

Anyways, at the moment, the Copper disassembler looks like this:

While working on the disassembler, I was looking for some standard Copper assembly notation, but I didn't find any. I therefore invented my own. If it turns out that there is some kind of established notation (which I am not aware of, because I spent so much time on the C64 that the Amiga is brand new technology for me), I can easily change that.

dirkwhoffmann commented 5 years ago

Although I’m still in the design phase, I have done considerable progress: Important design decision are going to emerge. The first major decision was to move from a mixed event/polling-based design to a truly event-based design.

A major part of the emulator is the DMA controller. The heart of the DMA controller is the event scheduler which consists of several event slots. From a theoretical point of view, each event slot is a single state machine with timed transitions. Right now, there are 5 slots (meaning we have 5 state machines running in parallel):

Slot 1: CIA A
Slot 2: CIA B
Slot 3: Disk, Audio, Sprite, and Bitplane DMA
Slot 4: Copper
Slot 5: Blitter
Slot 6: Rasterline (HSYNC events)

To give an example, let’s look at slot 3. In each HSYNC event, a slot-3-event is scheduled that triggers at the first horizontal beam position where DMA happens. If Disk DMA in enabled, this will be position 7. Once the event is served, the next DMA event is scheduled. Although this sounds simple to implement, it is not. The challenge here is to find out when the next DMA event happens for a given hpos. This is dependent on a lot of factors (DMA enable bits, lores / hires mode, vblank area etc.). To implement this efficiently, I decided to use a precomputed DMA event table. Whenever one of the influencing factors changes (e.g., the vblank area is entered), a DMA time slot allocation table is computed which resembles Fig. 6.9. in the Hardware Reference Manual.

Let’s test this out with the current prototype. If we enable Disk DMA, Sprite DMA, Audio DMA for channel 1 and 2 and bitplane DMA in the DMA inspector panel, the event table looks like this (Denise has three bitplanes enables and runs in lowres modes):

00000000000000001111111111111111222222222222222233333333333333334444444444444444
0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF
.......D.D.D...A.A...S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.S.......L.L.L...L.L.L...L.L.L
.......I.I.I...1.2...0.0.1.1.2.2.3.3.4.4.5.5.6.6.7.7.......2.3.1...2.3.1...2.3.1

To speed things up, the emulator computes a jump table in addition to the event table. For each hpos value, the jump tables indicates the hpos where the next event happens. E.g., at position 0x33, the jump table contains the value 0x3B which is a L2 (lowres bitplane 2 fetch event).

I hope I haven't overseen any theoretical flaws in this design.

dirkwhoffmann commented 5 years ago

Now I’m at a point where the Copper list has been initialized. To examine the list, open the Copper inspector panel:

The first list makes sense to me. The Copper is programmed to restore the initial values of the bitplane pointer and the sprite pointer registers. After that, it waits until a certain beam position is reached and jumps to the second Copper list. The second list seems to be uninitialised yet, because the commands make no sense. Commands in red indicate an illegal command. Illegal commands are MOVE commands accessing custom registers Copper has no access to.

It’ll be interesting to see how Copper executes these commands. To see it, I would have to step to the point where Copper DMA is switched on. Unfortunately, I cannot step there yet, because the emulator stops at FCADBA with an error message:

Memory: WARNING: pokeCustom16(DFF040 [BLTCON0], 0): MISSING IMPLEMENTATION
Amiga: Pause

The Kickstart code writes into one of the Blitter registers, so it seems the right time to start working on this component ...

dirkwhoffmann commented 5 years ago

I’ve read through the Hardware Reference Manual. Bottom line is that the Blitter is easy from a functional point of view, but difficult if exact timing is taken into account. As stated in the HRM, we have to deal with varying time slice patterns (depending on the enabled DMA channels):

I have reviewed a couple of existing Blitter implementations (with UAE the most cryptic one again) and I came to the conclusion that I want to try something new here. I’m going to control my virtual Blitter via emulated micro instructions.

More precisely: When the BLTSIZE register is written to (which starts a blit), the emulator will analyse the current DMA configuration and set up a micro instruction list. After that, the event scheduler will be programmed to trigger Blitter events and each event will then execute a single micro instruction.

Here is an example instruction list for the first Blitter configuration in Table 6-2.

   case 0b1111: { // A0 B0 C0 -- A1 B1 C1 D0 A2 B2 C2 D1 D2

        uint16_t prog[] = {

            FETCH_A,
            FETCH_B | HOLD_A,
            FETCH_C | HOLD_B,
            HOLD_D,

            FETCH_A,
            FETCH_B | HOLD_A,
            FETCH_C | HOLD_B,
            WRITE_D | HOLD_D | LOOPBACK3,

            WRITE_D
        };
        memcpy(microInstr, prog, sizeof(prog));
        break;
    }

The micro instructions allow me to emulate the data flow in the real Blitter quite accurately (The Blitter is designed in form of a traditional pipeline with “hold” register forming an intermediate pipeline stage).

Although this approach sounds promising to me (because of it’s flexibility), I’m totally unsure if this is the right way to go. Time will tell…

dirkwhoffmann commented 5 years ago

Now, I'm at

FCADC2: move.w #$41, ($58,A4)

This writes a value into BLTSIZE which means that a Blitter operation is about to come 😬.

Let's step over it ...

FATAL ERROR: Unimplemented Blitter configuration

Kickstart is starting the Blitter with the BLTCON = 0 config in Table 6-2. OK, there is no micro code for that config yet... I thought it's stupid to use this configuration and now it's the first one being used 🙈.

dirkwhoffmann commented 5 years ago

OK, now the Blitter has some microcode for its most meaningless mode. I also tweaked the debug output a little, so it's easier to see what's going on internally:

              Master cycles     CPU cycles    DMA cycles    CIA cycles
 Master clock:      59359184       14839796       7419898       1483979
    DMA clock:      59359184       14839796       7419898       1483979
  Frame clock:      59303296       14825824       7412912       1482582
  CIA A clock:           360             90            45             9
  CIA B clock:           360             90            45             9
  Color clock: (61,176) hex: ($3D,$B0) Frame: 208

  CIA A: Event: CIA_WAKEUP      Trigger: disabled
  CIA B: Event: CIA_WAKEUP      Trigger: disabled
    DMA: Event: DMA_DISK        Trigger: 59399616 (5054 DMA cycles away)
 Copper: Event: none            Trigger: disabled
Blitter: Event: BLT_EXECUTE     Trigger: 59359128 (-7 DMA cycles away) 
 Raster: Event: RAS_HSYNC       Trigger: 59359584 (50 DMA cycles away)

Woohoo, for the first time there is pending message in the Blitter slot 🥳. But wait ... it is overdue since 7 cycles ... this should never happen 😖.

dirkwhoffmann commented 5 years ago

A brief update about what’s going on in the OCS family.

Finally Paula got her own interrupt scheduler. It's a rather sophisticated device that allows her to trigger interrupts in certain cycles, e.g. in five cycles from now on, with little computational overhead. She hasn't really used it so far because there were simply no IRQ requests. I told her to be patient a little while longer as this is going to change soon.

Denise became jealous because she is the component with the fewest lines of code yet. Because her sister got this super cool interrupt scheduler, she now insists on getting the pixel engine I promised her some time ago. I told her we had to debug Copper first, because a pixel engine without a Copper is pretty useless. Somehow I felt she wasn't really listening.

Agnus is quite happy with his event scheduler. He says that planning events is much more fun than polling regularly. Unfortunately, it's still not an easy task for him. He continues to plan events with invalid time stamps and the like, but I am pretty confident that he will improve that over time. He also likes to be the one in charge of the bus. At first he had the idea to exclude his little sisters from the bus and keep all cycles for himself. I tried to convince him that this was not possible. We have to follow the rules, namely the DMA time slot allocation as stated in the hardware reference manual. I'm not sure he really understood what I meant, because he still does strange things from time to time.

Besides my struggle with the OCS chips, I continued stepping through kickstart to a point where the real trouble begins 😬:

As you can seen, Kickstart has enabled all kinds of DMA now. The rest of the story can be told rather quickly. When Copper saw his DMA flag set, he run off like crazy 🤪, scheduled some weird events and crashed the whole thing 🙈. Well, as I said above: I need to debug Copper first 😖.

dirkwhoffmann commented 5 years ago

After fixing some bugs, it’s time to give Copper another chance. The fun starts when Copper's DMA flag is set:

Copper: (0,5): COP_REQUEST_DMA
Copper: (0,7): COP_MOVE: coppc = 422 copins2 = 0
DMAController: pokeBPL0PTH(0)

Looks good so far ... the first MOVE command has been executed 😎.

Copper: (0,79): COP_WAIT_OR_SKIP: coppc = 46A copins2 = FFFE

That’s the

WAIT* ($0C,$00)

command. Let’s check what kind of effect that had ...

Primary event table:
Slot: Copper     Event: COP_FETCH       Trigger: 100061600 (2645 DMA cycles away)

Good news here, Copper went idle. Now the question is if he's going to wake up exactly at the specified beam position 🤔…

Copper: (12,0): COP_FETCH: coppc = 46C copins1 = 8A

😃 Yeah, it continues at (12,0).

The next command is the MOVE command writing into the Strobe register. This is going to redirect us to the second Copper list.

Copper: (12,2): COP_MOVE: coppc = 46E copins2 = 0. 
Copper: pokeCOPJMP2

The second list consists of a single command. It’s a WAIT statement then never triggers and therefore disables the Copper.

WAIT* ($FF,$FE)

Let's keep our fingers crossed 🤞...

Copper: (12,4): COP_FETCH: coppc = 474 copins1 = FFFF
Copper: (12,6): COP_WAIT_OR_SKIP: coppc = 476 copins2 = FFFE

So, let’s check our event list. The Copper slot should be disabled by now:

Slot: Copper     Event: WAIT_OR_SKIP    Trigger: never

Pretty cool 🥳. Copper successfully processed his first list.

dirkwhoffmann commented 5 years ago

Time for a brief update. After providing each custom chip with some basic functionality (Agnus schedules events, Denise does DMA, Paula triggers interrupts), the OCS kids seem to be happy with what they have (expect Denise who is still angry, because she didn’t get a pixel engine yet ). The problem is that the custom chips run in an endless loop now. I expected them to draw the hand & disk picture eventually, but they don't seem to care about what I want 🙁.

Because endless loops are hard to debug, I decided to work on some missing stuff with the hope that one of it is the cause for the infinite loop.

One of these things is drive identification. Hence, my current goal is to let the internal drives identify themselves correctly as 3.5” DD drives. Fortunately, the identification happens in documented Kickstart land:

; DRT_AMIGA      EQU $00000000     ; standard 3.5" DD Amiga disk
; DRT_37422D2S   EQU $55555555     ; 5.25"
; DRT_150RPM     EQU $AAAAAAAA     ; 3.5" HD drive with HD disk in
; DRT_EMPTY      EQU $FFFFFFFF     ; empty drive

   ; note that values returned by drive are negation of what is saved
   ; by disk.resource

FC48F4 move.l    #0,$30(A2)        ; we always have df0 of Amiga type
FC48FC moveq     #2,D2
FC48FE move.b    #$10,D3           ; SEL1
FC4902 lea       $34(A2),A3

FC4906 bsr.l     $FC491C           ; check the driveid for the other 3 units
FC490A lsl.b     #1,D3             ; next unit
FC490C dbra      D2,$FC4906
FC4910 move.l    A2,A1
FC4912 jsr       -$01E6(A6)        ; AddResource
FC4916 movem.l   (SP)+,D2/D3/A2/A3/A6
FC491A rts

FC491C not.b     D3                ; inverts select bit
FC491E lea       $BFD100,A0        ; CIA-B prb
FC4924 move.b    #$7F,D0           ; prepares motor on
FC4928 move.b    D0,(A0)
FC492A and.b     D3,D0             ; motor on for selected drive
FC492C move.b    D0,(A0)
FC492E move.b    #$FF,(A0)         ; deselect drive
FC4932 move.b    D3,(A0)           ; motor off
FC4934 move.b    #$FF,(A0)         ; deselect drive - this resets drive shift id port
FC4938 moveq     #$1F,D1           ; loop 32x
FC493A moveq     #0,D0
FC493C lsl.l     #1,D0
FC493E move.b    D3,(A0)           ; select drive
FC4940 btst      #5,$BFE001        ; check the /RDY bit
FC4948 beq.s     $FC494E
FC494A bset      #0,D0             ; if not set particular bit
FC494E move.b    #$FF,(A0)         ; deselect drive
FC4952 dbra      D1,$FC493C        ; process all 32 bits
FC4956 move.l    D0,(A3)+          ; store result in DR_UNITID
FC4958 not.b     D3                ; turn back select bit
FC495A rts

This is also the place where we could tell Kickstart we had an HD drive 😎. $AAAAAAAA is the secret passphrase.

dirkwhoffmann commented 5 years ago

OK, I can now transfer any 32-bit drive identification key over the RDY line serially. This is nice, but pretty useless at the moment. Why? Because Kickstart knows that df0 is always a standard drive and therefore skips the serial transmission step for it:

FC48F4 move.l    #0,$30(A2)        ; we always have df0 of Amiga type

The mechanism only becomes important, when external drives come into play.

Seems like I have to come up with another idea to tackle my infinite loop problem 🤔.

dirkwhoffmann commented 5 years ago

Some news about the hand & disk screen hunt.

My goal is to reach memory location FC570E. This is where the BLTSIZE register is written to with non-trivial values and the emulator is supposed to blow up there (it’s supposed to blow up, because there is no Blitter micro code for non-trivial blits, but that’s another story and has been done purposely).

By stepping back manually through the Omega CPU trace log, I was able to identify the following memory location sequence. This is the result:

fe89b0 : Bitplane DMA switched off
fe89c6
fe89da
fe8a06
fe8b88
fe8bbe
fe8c3e
fe8c46
fe8c70
fe8cca
fe8cd0
fe8cd6
fe8ce4
fe8ce8
fc55c8
fc570e : Bltsize is written to

Between those addresses, a lot of sub routine stuff is going on.

The good news is that vAmiga already reaches FE89B0. This is where Bitplane DMA is switched off. Hence, it remains to check where in this sequence vAmiga gets lost.

mithrendal commented 5 years ago

I am just doing the same back stepping in omega

spotted first blitsize at

fc5654: lea     $dff000.l, A0
...
fc570e: move.w  D0, ($58,A0)  <--- first write to blit size at fc570e

which corresponds perfectly to dirks spotted address

a short window out of the full instruction log of omega shortly before blitter action follows here

...
fe8d68: bra     fe8cf8
fe8cf8: moveq   #$0, D3
...
fe8d62: movea.l A3, A1
fe8d64: jsr     (-$f6,A6)    <-- CPU jsr to $2108 that means a6=$2108+$f6

2108: jmp     $fc55c8.l  <------  PC is at adress $2108 and CPU jumps to $fc55c8 (no jsr!)

fc55c8: tst.l   (A1)
fc55ca: bne     fc6834
fc55ce: movem.l D2-D7/A2-A3, -(A7)
fc55d2: movem.w ($24,A1), D2-D3
fc55d8: movem.w D0-D1, ($24,A1)
...
fc5654: lea     $dff000.l, A0       <--- CPU loads custom chip register base into A0, be prepared to expect that something impressive will going on here
fc565a: movea.w D2, A3
fc565c: move.w  ($22,A1), D6
fc5660: addq.w  #1, ($aa,A6)
fc5664: beq     fc566a
fc566a: btst    #$6, ($2,A0)
fc5670: btst    #$6, ($2,A0)
fc5676: beq     fc567c
fc567c: move.w  D1, ($62,A0)
fc5680: move.w  D2, D1
fc5682: sub.w   D0, D1
fc5684: move.w  D1, ($64,A0)
fc5688: moveq   #-$1, D1
fc568a: move.l  D1, ($44,A0)
fc568e: move.w  #$8000, ($74,A0)
fc5694: move.w  (A2), ($60,A0)
fc5698: move.b  ($1f,A1), D1
fc569c: swap    D1
fc569e: clr.w   D1
fc56a0: asr.l   #4, D1
fc56a2: or.w    D1, D7
fc56a4: sub.b   D0, ($1f,A1)
fc56a8: move.w  D7, D5
fc56aa: addq.w  #1, D0
fc56ac: asl.w   #6, D0
fc56ae: addq.w  #2, D0
fc56b0: move.w  D4, D2
fc56b2: swap    D4
fc56b4: asr.l   #4, D4
fc56b6: ori.w   #$b00, D4
fc56ba: clr.w   D1
fc56bc: bclr    #$0, ($21,A1)
fc56c2: bne     fc56cc
fc56c4: cmpi.b  #$2, ($1c,A1)
fc56ca: beq     fc572a
fc56cc: move.b  ($5,A2), D2
fc56d0: lea     ($8,A2), A2
fc56d4: move.l  (A2)+, D7
fc56d6: btst    D1, ($18,A1)
fc56da: beq     fc5712
fc56dc: swap    D5
fc56de: move.w  D4, D5
fc56e0: move.b  ($28,A1,D1.w), D5
fc56e4: swap    D5
fc56e6: add.l   D3, D7
fc56e8: btst    #$6, ($2,A0)
fc56ee: btst    #$6, ($2,A0)
fc56f4: beq     fc56fa
fc56fa: move.l  D5, ($40,A0)
fc56fe: move.w  A3, ($52,A0)
fc5702: move.l  D7, ($48,A0)
fc5706: move.l  D7, ($54,A0)
fc570a: move.w  D6, ($72,A0)
fc570e: move.w  D0, ($58,A0)  <--- first write to blit size at fc570e
fc5712: addq.b  #1, D1
-----> the omega Blitter does draw a line <---

dirk said vAmiga is currently at this address

fe89b0: move.w #$100, $dff096.l --> this is the 2462237th omega instruction since start

vAmigas CPU still has to process 26129 CPU instructions that is only 1% of all instructions processed so far...

fc570e: move.w D0, ($58,A0) <--- first write to blit size, which is omegas 2488366th instruction since start

vAmiga has already taken 99% of the route to the hand and disk drawing 😀 ...

mithrendal commented 5 years ago

Kickstart v1.2 full instruction trace log until hand drawing code (executed by omega)...

e.g. from the very first instruction until instruction 3129644 where the hand and disk image is drawn...

68k_kick12.log.zip

dirkwhoffmann commented 5 years ago

Here’s the thing. After vAmiga reaches fe89b0, it eventually executes fc0716 (and so does Omega).

The first comparison is false, so it does not branch. This means that the jsr (-$13e,A6) is taken (same in Omega). After returning, it jumps to the comparison statement again. In Omega, the comparison is now true, but in vAmiga it’s still false. The second jsr (-$13e,A6) never returns.

There is more than one function with offset -$13e:

ReadPixel (graphics.library)
UnGetC (dos.library)
Wait (exec.library)

This bug is a nightmare!

mithrendal commented 5 years ago

Dirk it is exec.library. Look at the content of a6 and for reference start sysinfo in omega to see the libbase adresses. From there you will see $676 the value of a6 is execbase.

grafik

dirkwhoffmann commented 5 years ago

I dumped the registers after each instruction. Omega calls exec.wait() with:

E fc071e: 4eae fec2           : jsr     (-$13e,A6)
    D0 = 80000000 D1 = 1F D2 = 0 D3 = 0 D4 = 0 D5 = 0 D6 = FFFFFFFF D7 = 0
    A0 = 1916 A1 = 18E6 A2 = 18E6 A3 = FE8B3A A4 = 5B88 A5 = 18BA A6 = 676 A7 = 18B2

Unfortunately, vAmiga's registers are completely different at the first call to exec.wait():

Hence, the real cause of the issue must have happened in one of the trillion lines executed before 😟.

mithrendal commented 5 years ago

but at your first picture the parameter d0 = 80000000 in vAmiga had the same value as in omega

look here grafik

was it the second call then ?

be cool. ;-) Maybe the state in omega is also not as that correct as an real Amiga would be.

dirkwhoffmann commented 5 years ago

Yes, the first picture shows the second call.

Seems like we manoeuvred ourselves into a dead end here. Only 26129 CPU cycles prior to the finish line 🙁.

mithrendal commented 5 years ago

yes so close ... 🥺

Why are the values of the data and address registers so different ? Strange...

dirkwhoffmann commented 5 years ago

Time for another update...

As you already know, I am hunting this little bastard (aka the "hand & disk is not drawn bug") for weeks now...

I was already close to surrender when things got personal ... something between me and him ...

jerry4

To make a long story short: Perseverance pays off

Now you want to know who this little bastard is: It's the Blitter busy flag in the DMACON register.

The Blitter deletes this flag at the moment it starts to flush the pipeline. Flushing is initiated by the LOOPBACK micro command in my implementation. Unfortunately, the first (strange) blit operation (which has all DMA disabled), has no LOOPBACK command, so the blit busy flag never got deleted.

dirkwhoffmann commented 5 years ago

Yeah, it's definitely a disk. The emulator is ready, let's ship it 😂.

mithrendal commented 5 years ago

Wow it is beautiful. The blitter inside Agnus has done this, right? Look at the clean drawn edges of the floppy disk. Look at the colors. That is inspiring... Green, red and blue... Ooh no, the colors are wrong, the OCS Kids used the wrong colors ... and why did they suddenly stop drawing ? Looks like they do quarrel again...?

dirkwhoffmann commented 5 years ago

The strange colours are my fault. Because Denise just started her drawing lessens, I decided to withhold the original palette from her. For practicing, I gave her four basic pencils only. A black one, a red, a green, and a blue one.

However, I need to have a serious word with Copper. In the middle of each frame, he constantly takes away alls her pencils. So mean ☹️.

dirkwhoffmann commented 5 years ago

This is vAmiga WE (Warhol Edition):

Interestingly, all text items are broken (most likely some Blitter issues).

dirkwhoffmann commented 5 years ago

Denise just figured out the fake pen thing. Before she goes mad, I better give her the real ones 😬:

dirkwhoffmann commented 5 years ago

😎

I have to admit that I cheated a little bit. I've shamelessly copied over the line Blitter stuff from the Omega emulator 🙄. The copy Blitter stuff is original vAmiga though.

The next step will be to get the texture dimensions right. The emulator is still using the original texture drawing stuff from VirtualC64.

dirkwhoffmann commented 5 years ago

A brief update:

Firstly, the screen buffer size has been changed to 768 x 288. Secondly, the bitplane DMA has been decoupled from the drawing code. There are separate events for bitplane DMA and pixel synthesis now. This makes the design very flexible, although the exact timing is still wrong for sure.

I've done a brief comparison of screen geometries:

Left is Omega, middle is vAmiga, right is an UAE clone (presumably PAL).

Don't get confused with the vAmiga picture. For debugging, the emulator is currently displaying the whole 1024 x 512 GPU texture. The blue area is unused texture area. The orange area contains a debug pattern (yellow and red stripes). This area is writable by the emulator, but hasn't been written to.

As you can see, Omega has a smaller lower border which is most likely due to NTSC emulation. The vAmiga geometry (PAL) looks roughly the same as the picture to the right, so I think I am on the right track...

dirkwhoffmann commented 5 years ago

The first draft of the GPU pipeline architecture has been completed and implemented. Details are here:

https://github.com/dirkwhoffmann/vAmiga/wiki/GPU

Using the new pipeline, the current output looks like this:

I've also managed to port the 2x upscaler from VirtualC64 to vAmiga. Using 2x upscaling, the picture is indeed a lot smoother:

I don't plan to support 4x upscaling at the moment (as in VirtualC64), because it would require a very large internal texture size of 4096 x 4096. For the C64, 2048 x 2048 was sufficient.

There is still a long way to go to V0.1, because the whole thing is still pretty unstable.

dirkwhoffmann commented 5 years ago

I just reworked the graphics pipeline (because the 2x upscaler had a bug) and did notice that 4096 x 4096 textures don't seem to be an issue for modern GPUs. Hence, 4x upscaling will be supported. Here is the result:

Original Amiga texture:

2x upscaling (EPX algorithm):

4x upscaling (xBr algorithm):

dirkwhoffmann commented 5 years ago

Before I continue, I need to enrich the emulator with more debugging capabilities. Pending events can now be watched in the new "Events" inspector panel:

dirkwhoffmann commented 5 years ago

Hmmm, when enabling all graphics effects (i.e., Gaussian blur), GPU performance on my (not so old) MacBook Pro goes down to 40 fps. Seems like a final texture size of 4096 x 4096 stresses the GPU too much. Maybe it's better to go with a final texture size of 2048 x 2048 (which requires the 4x upscaler to be removed 😢).

dirkwhoffmann commented 5 years ago

Now as I thought about it a little longer, we can still achieve 4 x upscaling with a 2048 x 2048 texture, at leat in lores mode. In lores mode, each pixel has size 4 x 4, so we can apply an upscaling algorithm "inside" the original texture. In hires mode, we can upscale at least vertically, because the even and odd lines are the same. The only mode that can only be upscaled 2x is hires+interlaced, but this mode is rarely used anyway.

dirkwhoffmann commented 5 years ago

Back in the game at 60 fps. 2x upscaling, Gaussian blur, Trinitron dot mask + electron beam misalignment 😎:

Now, as EPX and xBr both do 2x scaling, they can be compared directly (first = original, second = EPX upscaled, third = xBr upscaled)

Hmm, when looking upclose at the xBr image, it looks like there is a bug in the xBr implementation. The line contains strange jagged edges. If this is a bug, it's also contained in VirtualC64 🤔.

dirkwhoffmann commented 5 years ago

It's a bug. I've just converted a JavaScript xBr implementation to Metal to compare the result. The upper picture shows how it is supposed to look like and the lower picture is the current GPU implementation.

I am going to investigate this first, because it also affects VirtualC64. (I cannot simply replace the old implementation, because the JavaScript port is not GPU optimised and thus comparably slow.)

dirkwhoffmann commented 5 years ago

I've experimented a little with a two-phase upscaling pipeline. The first upscaler works "inside" the emulator texture to enhance lores images. The second upscaler is the already implemented one. The result (EPX in-texture upscaling + xBr) looks pretty promising:

The problem here is that picture quality decreases in hires mode, so we have to take care of usability. Maybe, I can enhance the upscaler to detect lores fragments automatically and only apply the first upscaling step to the lowres parts.

dirkwhoffmann / vAmiga

Logbook for version 0.1 #1