Multi-CPU support - Githubissues

iamgreaser commented 8 years ago

This is probably not a goal of the project but it sorta works already and there's not much that's required to make it work properly.

Basically, these things need to be done:

Step 1: Give CPUs a bus ID

Pretty self-explanatory. No actual API needs to be provided, just make sure there's at least one byte available.

Step 2: Make it possible for a CPU to know its own ID

Most logical approach: Add an I/O port to the bus controller to query the current component's bus ID.

Step 3: Add support for synchronisation primitives

Well, there's basically two things that need to be done here: one which covers just about every sync primitive and is nice and easy to implement and such an approach is only necessary for if we ever run stuff in different threads, and one which covers the sync primitive I and maybe ds84182 will be using and is a bit tricky.

Fun fact: Neither of these methods are used on the Z80.

Step 3a: Support for atomic locking

Only required if you actually run several CPUs on a thread, but x86, 680x0 and ARMv3+ at the very least will benefit from this.

If CPUs ever end up running on multiple threads, it may be necessary to be able to lock an address for atomic operations.

lockBus() and unlockBus(), perhaps? The other option is to use synchronized(bus.getLock()) { ... } which is a bit ugly on the encapsulation side but less likely to explode in your face, and because it's safer to actually use I prefer it.

It's probably OK to not require a lock on read, just grab the lock on write. The greatest portion of your data accesses are reads, so this shouldn't be a major performance hit, especially if your CPU implements a cache. If anything needs a lock to merely read from an address, it's probably an I/O device.

Anyway, this covers CAS, TAS, and various other locked atomic ops.

Step 3b: Support for LL/SC

This is what's used in several RISC CPUs (Alpha, ARMv6+, MIPS II+, PowerPC).

This will need an API. If this turns out to be wrong I'll be willing to jump in and fix it but here we go:

On write, check the address against a mask (suitable default when not using LL/SC: llbase=1, llmask=0, lladdr=don't care, llcpuid=-1). If ((addr&llmask)^llbase)==0, then reset to default values. You may need further checks than this if you are supporting unaligned reads/writes on the bus - (((addr+size_in_bytes-1)&llmask)^llbase)==0 would make sense too.
On LL (tentative sig: long readLL(int cpuid, long addr, long mask)), set llcpuid to cpuid, set llmask to mask, set llbase to (addr&~llmask), set lladdr to addr.
On SC (tentative sig: boolean writeSC(int cpuid, long addr, long data)), ensure that cpuid==llcpuid. If this is true, then write and return true. Otherwise, don't write, just reset to default values and return false. Of course, your CPU is ultimately responsible for making sure that the SC address + width was the same as the LL address + width.

If you'd rather I implemented this, then please feel free to just create the sigs in a suitable spot and I'll fill them in. Suitable stubs for now: readLL just goes to read, and writeSC just returns false.

As much as it would be easier to just lock the bus when LL is used and unlock it when SC is used, this would mean that anyone could deadlock the game server, so this shortcut will not be an option.

iamgreaser commented 8 years ago

I've had a couple more thoughts for this.

Firstly: Should EEPROMs be inserted into CPUs rather than read over the main bus? This should make booting a lot cleaner.

Elaborating: Z80 boots at 0x0000, MIPS boots at 0xFFFFFFFFBFC00000, 8086 boots at 0xFFFF0, 6502 boots at the address pointed by 0xFFFC, 68000 boots at the address pointed by 0x000004, ARM boots at 0x00000000(!)... whatever the hell you use, you do need to make sure it boots, and it's nice to be able to handle it per-CPU rather than risking having to polyglot things.

And now for the second thought: Should each CPU be given a share of each per-tick timeslice, instead of all fighting over it? This should reduce the stress on servers if people want to make a multi-CPU system.

That is, if you have e.g. a 3.58MHz Z80, and a 7.16MHz 68000, then assuming the bus isn't released early, the Z80 gets 89.5k cycles, and the 68000 gets 179k cycles per tick, instead of the usual 179k and 358k respectively.

I'm thinking these bus-release operations should also be available:

Wait Until Tick End: CPU gives up its cycles for the given tick and distributes them among the other CPUs.
Yield: CPU gets rescheduled to the end of the tick and gives up half of its remaining cycles, or all remaining cycles if there are fewer than, say, 1000*(active_other_cpus+1) cycles remaining.
Halt: For now, behave like Wait Until Tick End; however, this is what you send when you want to make it clear that you can be woken by an interrupt.

Alternatively they could all be called Stop, Halt, and Aufhoer, but that would just be taking the piss out of those CPUs that have both Halt and Stop as opcodes.

fnuecke commented 8 years ago

Multiple CPUs on one bus are indeed something I at least didn't want to make impossible! That's the reason power state/ticking is handled via the bus controller, instead of the CPU.

Right now they'd still all be updated by the one worker thread updating the bus. So - at least for now - locking would not be required. If we actually fully want to support this though, running them on separate workers would probably be nice (to not require cycle sharing, as you suggested).

EEPROM

~~Here's what I thought of, but kind of avoided implementing for now :P~~

~~EEPROMs go into the EEPROM reader, primarily to avoid having multiple block types with shared functionality (i.e. multiple CPU blocks that all have an EEPROM inventory and access it).~~
~~The reader has a configurable target address, to which it will copy the contents of the EEPROM in it upon boot.~~

~~Upside, flexible, downside, user has to manually set the address and know the address for the CPU that's being used.~~ For the "boots at address pointed by" cases (which I didn't know before :P), well. Either they default to zero, which would be a little tight for 0x4, or those CPUs have to have that start address configurable and write it to the pointer address upon boot.

~~I'm not fixated on that approach, yet.~~ Having stuff "auto-configured" by having the EEPROM in the CPU block would be nice. In particular as the address can then be a combined one in the "boot at address pointed by" cases. Hmm. Yeah, I guess that'd be better. Will make implementing a CPU for addons a bit harder, but eh :P

sharing cycles

At least if/while they'd run on one thread, that'd make a lot of sense, agreed.

iamgreaser commented 8 years ago

Right now they'd still all be updated by the one worker thread updating the bus.

In that case, 3a is already done.

If 3b is implemented by simply making sure that CPUs which use LL/SC drop the lock when their processing slice time is finished, then that could help to ensure bus accesses remain fast. It would, however, result in a few retries more than would be ideal from the in-game software perspective, but it doesn't take many cycles to retry LL/SC.

Running each CPU on a separate worker would make things very complicated, and would also make it hell for anyone running a server to ensure 20tps. Consider that the MIPS3 core, running at 2MHz, currently uses about 80% of a Sandy Bridge i5-2450M @ 2.5GHz (turboboost freq is 3.1GHz) on the Sun JRE, and about 50% of a Skylake i5-6500 @ 3.2GHz (turboboost freq is 3.6GHz) on OpenJDK. Code is rendering a mandelbrot, runs in usermode, and uses a writeback cache, so the bottleneck is definitely not the memory interface.

Will make implementing a CPU for addons a bit harder, but eh :P

private int read8(long paddr) {
    if((paddr>>12) == (bootRomVector ? 0x0 : 0xF)) {
        return eeprom.read(paddr&0xFFF);
    } else {
        return memory.read(paddr);
    }
}

Seems easy enough.

fnuecke commented 8 years ago

cpu load

Hrm, shared cycles/throttling does sound like the more sane thing then, yeah. Allright then.

addons

I was more thinking of the block having to have an inventory, having to filter for EEPROMs, accessing the data on the EEPROM. Once that's done, absolutely, copying it into memory on boot is easy. It's not terribly hard to get there, it's just a few more hoops to jump through.

iamgreaser commented 8 years ago

Copy-on-boot is kinda asking for trouble in a multi-CPU setup. Best to bypass the BC for a given address range.

For instance, with MIPS you'd map the EEPROM directly on the 1FCxxxxx range (if((paddr>>20)==0x1FCL)). With Z80 you'd probably map it to 0xxx on reset and make it possible to, using a CPU-private I/O port, unmap this region or maybe remap it to a more suitable position (Fxxx for instance - you don't have to make the remapping arbitrarily-selectable).

standard inventory check that will most likely be identical for every CPU

Abstract class, anyone?

fnuecke commented 8 years ago

Allright, so I didn't have mapping addresses around at runtime in mind, really, at least not via software. But after some mulling it over, here's how I think I'll approach it: have the bus controller have multiple valid configurations, and allow switching between those (manually or via software).

abstract class

Until someone already has another abstract base class for all their tile entities. Like almost everyone does ;) (And I don't want to force addons/integration into using my component system.)

fnuecke / Circuity

Multi-CPU support #17

Step 1: Give CPUs a bus ID

Step 2: Make it possible for a CPU to know its own ID

Step 3: Add support for synchronisation primitives

Step 3a: Support for atomic locking

Step 3b: Support for LL/SC