Custom CPU implementation

dirkwhoffmann commented 4 years ago

Many of the recently reported bugs seem to be related to bus and interrupt timing. To improve the situation, I favour the idea of integrating a custom CPU implementing into vAmiga. To get this project done in a decent time frame, I will take a reference implementation approach based on two already existing cores: Musashi and portable68000. These cores are going to serve as my functional reference and temporal reference, respectively.

This is my roadmap:

Task 1: Write a CPU which is functionally equivalent to portable68000.
Task 2: Add a disassembler (portable68000 has none).
Task 3: Add cycle counting.
Task 4: Integrate the new core in vAmiga.

Task 4 will require some smart recording logic, because I cannot simply run both cores in a row (the first CPU will alter memory and cause side effects). To cope with that, the second core must run in a fake environment that intercepts all memory calls and compares them to what the first CPU did.

These are my corresponding milestones:

Milestone 1: Pass all unit tests of portable68000 functionally.
Milestone 2: Match Musashi’s disassembler output.
Milestone 3: Pass all unit tests of portable68000 temporally.
Milestone 4: Run the new core side by side with Musashi with matching output for each and every executed command.

Once all four milestone have been reached, the new core can take over and will hopefully bring vAmiga to the next level.

Milestones reached so far: None 🤭

dirkwhoffmann commented 4 years ago

Here is something puzzling:

When executing ori.b #$0, $8010.w, Musashi writes the result back to $ff8010 whereas Moira writes back to $8010. The discrepancy got trapped by my sandbox:

Instruction: ori.b   #$0, $8010.w

ACCESS 0 DOESN'T MATCH:
i:  0  Type: Poke8   Addr:   8010  Cycle: 18  Value:    0  

ACCESS RECORD:
i:  0  Type: Poke8   Addr: ff8010  Cycle:  0  Value:    0

Here is the corresponding Musashi code:

static void m68k_op_ori_8_aw(void)
{
    uint src = OPER_I_8();
    uint ea = EA_AW_8();

    uint res = MASK_OUT_ABOVE_8(src | m68ki_read_8(ea));

    m68ki_write_8(ea, res);

    FLAG_N = NFLAG_8(res);
    FLAG_Z = res;
    FLAG_C = CFLAG_CLEAR;
    FLAG_V = VFLAG_CLEAR;
}

EA_AW_8 does sign-extension which means that $8010 becomes $FFFF8010. Because the 68000 has a 24-bit address bus, this is cropped to $FF8010.

What is going on here? 🤔 When I use a word operand (####.w), is it really treated as a signed number? I can't really believe this...

dirkwhoffmann commented 4 years ago

Indeed, the 68000 compiler issues an error message. Apparently, word addresses are treated as signed numbers 🤭:

error 2033 in line 43 of "ori1.s": absolute short address out of range
>   ori.b #$C5,$8010.w

mithrendal commented 4 years ago

also lea $8010.w,a1 gives a absolute short address out of range whereas lea $7fff.w,a1 is correct....

it is apparently not specific to ori...

Somehow ori reminds me on tolkiens dwarf names. Don't know why. Lets see what were the dwarf names ... Dwalin, Balin, Kili, Fili, Dori, Nori, Ori, Oin, Gloin, Bifur, Bofur, Bombur and Thorin.

Wow there is even a fan dome page for ori !!! Did not know that !

https://lotr.fandom.com/wiki/Ori

Ori was born in the late third age ... and has died 🙈 in TA2994 (does that mean TolkinAge?)

Completely fictional fact: Much later in modern days they invented a microprocessor instruction and for his honors gave it his name....

dirkwhoffmann commented 4 years ago

Dwalin, Balin, Kili, Fili, Dori, Nori, Ori, Oin, Gloin, Bifur, Bofur, Bombur and Thorin.

Oh, I see, it's this guy Moira has trouble with. Anyway, I think she is too young to have a friend (nope, no co-processors yet, Moira!).

Surprisingly, I can't find ORIs brother ANDI 😅. His opcode pattern is 0000 0010 xxxx xxxx.

and has died 🙈 in TA2994

Hmmm, sounds like a number of a CPU trap to me, but maybe I am just coding too much these days 🤓.

What's new? Musashi is playing happily with Moira in her new sandbox. As expected, she still can't count properly 🙄, but I'm pretty confident she's going to improve over time...

Instruction: btst    D0, $80008010.l
Instruction: btst    D0, (-$8000,PC); ($ffff9002)
Instruction: btst    D0, (PC,A0.w)
Instruction: btst    D0, #$0
Instruction: bchg    D0, D0

MISMATCH FOUND (opcode $140 out of $FFFF):

Instruction: bchg    D0, D0

    Musashi: PC: 1002 Elapsed cycles:  8
      Moira: PC: 1002 Elapsed cycles:  6

dirkwhoffmann commented 4 years ago

Problems ... and more problems ... 🙈

According to the M68000 User’s Manual, Ninth Edition, BCHG D0,D0takes 12 cycles? No?

In Musashi, however, the cycle count is hard coded to 8:

{m68k_op_bchg_32_r_d         , 0xf1f8, 0x0140, {  8,   8,   4,   4}},

In portable68000 and Denise, the cycle count varies depending on the bit number:

template<uint8_t Mode> auto M68000::cyclesBit(uint8_t bit) -> void {
    uint8_t cycles = 0;

    switch(Mode) {
        case Btst: cycles = 2; break;
        case Bclr: cycles = 2;
        case Bset:
        case Bchg:
            cycles += bit > 15 ? 4 : 2;
            break;
    }
    ctx->sync( cycles );
}

Looks like total anarchy here 🥺.

mithrendal commented 4 years ago

We could make a program which runs millions of bchg d0,d0 instructions. Run that on a500mmse and measures the time to know the exact number, no?

dirkwhoffmann commented 4 years ago

We could make a program

Yes, I think we need to write a test-case and run it on the MMSE 😎.

There is no need to run a million BCHGs though. We can let the VSYNC interrupt handler start the execution, run BCHGs until the raster beam reaches the middle of the screen and change the background color.

I already tried to do that, but I screwed it up ... obviously 🤭

dirkwhoffmann commented 4 years ago

It really can't be that difficult to write such a test-case 😖

mithrendal commented 4 years ago

There is no need to run a million BCHGs though. We can let the VSYNC interrupt handler start the execution, run BCHGs until the raster beam reaches the middle of the screen and change the background color.

Yes that is even better... we see the result immediatly 😃.

Assumption: 226 DMA cycles available in a horizontal scan line. 1 DMA cycle is 2 CPU cycles. bchg d0,d0 is 12 CPU cycles = 6 DMA cycles

Then the CPU should be able to process 37 bchg d0,d0 instructions in a line

The plan is if I understand you correctly to start at the vertical blank and let the CPU execute 370 times the instruction. At the end of all the instructions draw a color. Then when we see the color at scan line 100 the cycle length of 12 for bchg d0,d0 was correct...

mithrendal commented 4 years ago

grafik

this is a test of my program in fsuae from scanline zero to 64 red color 100*37 bchg d0,d0 green color the rest of the scan lines blue color

impossible that green spans 100 lines 🙈...

Oh I see I tested it on A1200 configuration

here is the program again on A500

grafik

better looks like a lot more lines ... but are these really 100 as it should be with bchg d0,d0 and 12 cycles? Looks like it has less than 12 cycles. I have to test on A1000 ...

mithrendal commented 4 years ago

grafik

the height of 100 yellow lines prove that green is not 100 lines height... ahem on fsuae

dirkwhoffmann commented 4 years ago

You've managed to write a working test case. So cool 😎.

Mine is still buggy 😕:

BTW, you don't need to count scanlines. Simply substitute a command for which we know how many cycles they need and compare the images. I guess one of the dwarf instructions will do: ORI or ANDI 😅

mithrendal commented 4 years ago

Simply substitute a command for which we know how many cycles they need and compare the images.

ok lets execute ori instead of bchg, and see how the cpu of fsuae times them...

grafik grafik

left picture: 390 times ori #0,d0 execution time in green

right picture: 390 times bchg d0,d0 execution time in green

BTW: I made a mistake I did not execute 37x100 as I mentioned before but 39x100 instructions.

So we know that FSUAE (probably WinUAE as well) emulates the execution time of bchg d0,d0 exactly with twice the cycles of ori #0,d0

I still have to create a adf from it and throw that onto the A1000 ...

the program is here timing_test.s.zip

mithrendal commented 4 years ago

new combined test is here as ADF bchg_ori_test.adf.zip as source code bchg_ori.s.zip

it produces 100 yellow lines 370 bchg d0,d0 executions in darker green 370 ori #0,d0 executions in lighter green

grafik

(picture FS-UAE setting high compatible CPU 68000)

Could you throw the adf onto the A500 MMSE and see what the correct timings are?

I found the sister of Ori !! Apparently Lea played in a completely different film genre though... grafik

dirkwhoffmann commented 4 years ago

I found the sister of Ori !!

😳 All of a sudden, I am loosing interest in the princess being held prisoner in Defender of the Crown.

But wait, wasn't she the sister of Luke? 🤔 There is no LUKE instruction though. Just a LINK instruction. Maybe LINK Skywalker sounded so stupid that they changed his name for the movie. This could also be the reason why they did another Star Wars movie. They finally reveal his real name? No?

Could you throw the adf onto the A500 MMSE and see what the correct timings are?

I'll do in a minute...

In the meantime, I also managed to fix my test case, so we have two now. My test utilises the Copper to trigger interrupts and I am performing the tests in the interrupt handlers. I have set up 6 interrupt handlers (priority 1 to 6), so I can run multiple timing tests in parallel. Here is the result in UAE:

Colors:

Blue: Copper wakes up to trigger the IRQ
Red: CPU enters the IRQ routine and sets up test case data
Yellow: The actual timing test

Test lines: 1: Running 12 NOPs, accounting for 48 cycles in total 2: Running 16 NOPs, accounting for 64 cycles in total 3: Running 8 BCHGs with shift value $00 4: Running 8 BCHGs with shift value $10 5, 6: Same as 3,4 with another destination register

Conclusion (for UAE):

The shift value does affect timing
For shift value $00, BCHG consumes 6 cycles
For shift value $10, BCHG consumes 8 cycles

I'm curious what the real machine will do. The bookmakers are now open. Please place your bets...

dirkwhoffmann commented 4 years ago

Here is a tricky one:

$4784: chk.w   D3, D3

ACCESS 2 DOESN'T MATCH:
i:  2  Type: Poke16  Addr: 7ffa  Cycle: 22  Value: 2700  

ACCESS RECORD:
i:  0  Type: Poke16  Addr: 7ffc  Cycle:   0  Value:    0  
i:  1  Type: Poke16  Addr: 7ffe  Cycle:   0  Value: 1002  
i:  2  Type: Poke16  Addr: 7ffa  Cycle:   0  Value: 2708

The mismatch is caused by the N bit in the status register. Musashi sets it to 1 before pushing the status register to the stack and Moira leaves it at 0.

OK, let's RTFM:

🤨 In our case, [Dn] < 0and [Dn] > [<ea>] are both true, so the manual doesn't help. Note: In hardware design, "undefined" is usually another word for "we don't care" or "we don't know".

How can we figure out which one is correct? The command initiates exception processing which means that the next command in my program is not executed. We need to write an exception handler that verifies the N flag for us 😬. Has anybody written such a thing before? No? 🙄

dirkwhoffmann commented 4 years ago

We need to write an exception handler that verifies the N flag for us

OK, trap handlers are as easy as interrupts... stay tuned 😎

dirkwhoffmann commented 4 years ago

Hier is my exception handler:

chkHandler:
    bmi     chkHandler2
    move.w  #$0F0,$DFF180
    rte
chkHandler2:
    move.w  #$F00,$DFF180
    rte

UAE:

vAmiga (Musashi):

And the winner is ... 😴

dirkwhoffmann commented 4 years ago

And the winner is ... Musashi 👏

dirkwhoffmann commented 4 years ago

I was curious to see if vAmiga and Moira happen to like each other. Unfortunately, no so much yet 😕.

So, what is going here? We are right at the beginning of the Kickstart Boot Rom (the same place where we've been exactly a year ago 😲):

        ; Set up the Exception Vector Table.  Vectors 2 through 47
        ; (Bus Error through TRAP #15) are all all set to the initial
        ; exception handler.  If any exception occurs now, the screen
        ; will turn yellow, the power light will flash, and the computer
        ; will be reset.

FC0136  move.w    #8,A0             Start at address 8 (vector #2).
FC013A  move.w    #$2D,D1           Do 46 vectors.
FC013E  lea       FC05B4(PC),A1     Address of initial exception handler.
FC0142  move.l    A1,(A0)+          Set one vector
FC0144  dbra      D1,FC0142(PC)     Loop back.

Seeing the screen turn yellow means that some exception had happed that should not happen. At first, I was disappointed, but if I think about it, this is quite good. It means that Moira can already process exceptions 🥳 and she is not color blinded (she wrote into the correct memory cell to change the background color). So the question is .... what kind of exception is going here? 🤔 No, it's not interrupts, I've already checked that... 🤨

dirkwhoffmann commented 4 years ago

I have started to convert the test programs created by cputester into ADFs. First instruction (in alphabetical order) is ABCD:

vAmiga with Musashi 🙈:

vAmiga with Moira 😎:

dirkwhoffmann commented 4 years ago

Time had come to use the big wrecking ball. With the latest checkin, Musashi is gone from the dev branch. I do feel a little sorry about it 😢, because I really liked that core and without its existence, I wouldn't have started the vAmiga project at all.

There is still a lot to do, because big portions of the old wrapper code need to be integrated into Moira (breakpoint support, instruction logging, etc.). To keep things simple, I plan to remove conditional breakpoints, because I never use them myself and a whole lot of code is needed to implement them. (A conditional breakpoint halts the CPU only when a certain condition holds, such as D0 == 42.).

dirkwhoffmann commented 4 years ago

To deeper understand the problem I try to learn what vAmigas Agnus controller does in the current implementation. I spotted the code partly in agnus.cpp and memory.cpp but I can not see the behaviour easily.

I have reimplemented bus sharing with Moira in hand. The code is much much cleaner now:

Here is the run loop (the outermost loop of the emulator thread):

   do {

        // Emulate the next CPU instruction
        cpu.execute();

        // Check if special action needs to be taken
        if (runLoopCtrl) {
            ...
        }
    } while (1);

Here is function Moira::sync()

void
CPU::sync(int cycles)
{
    // Advance the CPU clock
    clock += cycles;

    // Emulate Agnus up to the same cycle
    agnus.executeUntil(CPU_CYCLES(clock));
}

Here is Agnus::executeUntilBusIsFree()

void
Agnus::executeUntilBusIsFree()
{
    DMACycle delay = 0;

    // Return immediately if the bus is free
    if (busOwner[pos.h] == BUS_NONE) return;

    // Execute Agnus until the bus is free
    do {
        execute();
        delay++;
    } while (busOwner[pos.h] != BUS_NONE);

    // Add wait states to the CPU
    cpu.addWaitStates(AS_CPU_CYCLES(DMA_CYCLES(delay)));
}

I have to admit that the code is completely untested yet 🤭. For now, I'm really happy that the code architecture has become so simple by replacing Musashi with Moira.

dirkwhoffmann / vAmiga

Custom CPU implementation #251