flipacholas / Architecture-of-consoles

Technical articles about console architecture
https://www.copetti.org/writings/consoles/
Creative Commons Attribution 4.0 International
860 stars 59 forks source link

Some commentary on the 3DS and ARM #234

Closed joha4270 closed 1 year ago

joha4270 commented 1 year ago

ARM’s licensing model happens to be favourable to Nintendo as they have always offered synthesisable designs

It's really ARM's bussiness model isn't it? ARM does not really sell phyiscal processors, just IP cores and licenseable designs.

The new multi-core instructions consist of Store and Load opcodes with special care for synchronisation.

Are we talking atomic loads/Stores aka LDREX and STREX here? I do realize they are called STR & LDR, but Store in code makes my brain go to the humble store and having it refer to anything else makes my brain hurt. (There are also the memory barriers which I think was added in ARMv6 but those are a whole different kettle of fish and I'm not qualified to talk intelligently about those)

ARM’s CPUs speak many ‘languages’. In the case of an ARM11-based core, you are provided with:

One here is not like the others. VFP uses the so called "coprocessor instructions", but a coprocessor in ARM terminology is an extension point, not a separate unit.

As another note, there are also 32-bit Thumb instructions, but the point is indeed the 16 bit ones.

  • armel : unoptimized, compatible with ARMv4T onwards.

I don't think its fair to describe it as "unoptimized". It has less hardware to work with and is thus inevitably slower for some things. I can outpace Usain Bolt on my bike, but that does not make him slow, it just means I have more hardware to make me go fast.

And well, the x86 never got quite as fragmented. This probably relates to how ARM cores can be customized a lot more with optional features and selectable performance-efficiency-cost tradeoffs, so there are a lot of different ARM cores out there. You don't usually see the same kind fragmentation in x86.

Doesn't mean there isn't a lot of different CPU's with different instructions in x86/x64 and you can tune for a specific one. GCC (and other compilers) has a huge amount of possible profiles that can run faster on specific machines at the cost of not running at all on some others.

The AXI bus

Calling AXI "specialized" is something of a reach. Its used by ARM, some RISC-V cores, FPGA's, FPGA IP. Its probably the most popular way for a processor to talk with the world. That's the source of some of my problems with part: it mixes up "AXI the protocol" and "this very specific implementation using AXI". It also just gets some details wrong.

a bus topology.

AXI is just an interface. It specifies nothing about the topology. Nothing prevents making a token-ring speaking AXI. Some fancy FPGA's have AXI Networks.

there will be a master-slave hierarchy imposed to maintain order

It's not to maintain order. Its a classification based upon if the interface shows initiative (CPU, DMA) or not (RAM, Hardware, DMA). Every AXI transaction starts at a master and ends at a slave.

AXI uses a dedicated block

AXI does not. This implementation possibly does.

In doing so, AXI overcomes the limitations of high-bandwidth components sharing the same bus

As above. You can certainly make a fully connected AXI Interconnect supporting arbitrary connections from all slaves (from where an AXI master sends commands to the interconnect) to all masters (where transactions then go on to another slave), if you're willing to pay the cost in silicon. (You often are, since CPU's are latency sensitive, but again, implementation of AXI, not the spec)

And finally for the picture. There is no such thing as a slave bus or a master bus. There is a slave interface, connected to master interface, but this interface between them is just an interface. I haven't tracked down the DMA, but it has probably both a master and a slave interface, both connected to the interconnect. The slave interface for receiving instructions from the CPU and the master for executing those.

Well, techincally neither master or slave interfaces exists. ARM has decided to replace the "Master/Slave" terminology with "Manager/Subordinate". I'll be keeping the same terminology as the article.

Faster transfer rates compared to either CPU.

You're sure about that? I can't discard the possibility that the DMA's are faster, but the real point is likely just that the CPU is free to focus on something else (say animating a splash screen), while the DMA uses any spare bus capacity in the background.

Also, I'm not sure 8 channels counts as an advantage and not just a feature.

programmers base their algorithms on the multi-threading model:...

You're disallowed from using just a single thread? This section seems a little confused. As long as you can read and write to the instruction pointer (even indirectly) you can do multithreading. Are you possibly trying to say that the 3DS OS provides kernel threads and a scheduler?

and the intellectual property.

Are you possibly getting confused by IP Cores. Despite the name, buying an IP core is a license and not the same as buying the intellectual property. You even do write license the next paragraphs.

ARM later improved this technique by supplying a dedicated register called mpidr, found in ARMv7 CPUs.

mpidr is a part of CP15.

And I'm not convinced that it isn't the register CPU ID above actually refers to.

flipacholas commented 1 year ago

Hi,

I can't discard the possibility that the DMA's are faster, but the real point is likely just that the CPU is free to focus on something else (say animating a splash screen), while the DMA uses any spare bus capacity in the background.

This is a multi-core CPU, so I imagine there will be concurrency already in place, so my point is about identifying the distinct capabilities of the DMA units.

You're disallowed from using just a single thread?

No, I'm not sure where you got that idea. The point of that paragraph is to explain that programming a multi-core CPU is no longer an obscure task, thanks to high-level languages and multi-threading models.

mpidr is a part of CP15.

But that was still introduced with the ARMv7?

I'm going over the rest, thanks!

joha4270 commented 1 year ago

This is a multi-core CPU, so I imagine there will be concurrency already in place, so my point is about identifying the distinct capabilities of the DMA units.

I just don't think "faster" is one of its "distinct capabilities". The CPU has priority to the bus (well, I don't know this, but its latency sensitive in a way the DMA isn't, so anything else would be a very "interesting" design), so unless the CPU can't even use half the bus bandwidth (a faster bus would use more power, so I'm going to put an unlikely here too), the DMA is always going to be a disadvantage unless the CPU is doing literally nothing.
There are scenarios where the DMA is faster (IO peripheral with a small/no buffer is the classic example). There are certainly possible designs where the DMA would be faster for bulk copy too. But I would expect to see that in something like a chip for a network switch, not a low power gaming console. Most games does not need bulk data transfers often. The CoreLink DMA-330 DMA Controller Technical Reference Manual, Revision: r1p2 does not contain the word "faster", so I think you need a slightly more specific source for the claim that "The DMA is faster at copying memory than the CPU", than what you have.

You're disallowed from using just a single thread?

No, I'm not sure where you got that idea.

It was a rhetorical question. You spend a lot of words describing that this water is wet, so of course I wonder: is this water somehow special, or is it just needlessly specific?

The point of that paragraph is to explain that programming a multi-core CPU is no longer an obscure task, thanks to high-level languages and multi-threading models.

Multithreading API's show up in most operating systems soon after you got multiple cores to play with. Multitasking has overheads, both for the developer and at runtime, so with 1.3 core to play with, I doubt it was particularly heavily used. Most likely, the vast majority of games, had the vast majority of of game logic in a single thread, with a few auxiliary threads.

The point I'm circling around is that I think you should kill that entire paragraph since I think it adds nothing but confusion.

mpidr is a part of CP15.

But that was still introduced with the ARMv7?

From an admittedly brief search, I cannot find anything in the CPUID registers that describes which core its executing on, but mpidr is specifically described as allowing you to distinguish which core you're running on. The ID_AFR0 register does contain 16 implementation defined bits which Nintendo could have used, but from over here, that sounds like a decidedly odd way of doing it.

flipacholas commented 1 year ago

I just don't think "faster" is one of its "distinct capabilities".

I've amended the block to be more precise on the advantages of DMA, leaving the 'faster' word as a more broad adjective that applies to the system as a whole.

It was a rhetorical question. You spend a lot of words describing that this water is wet, so of course I wonder: is this water somehow special, or is it just needlessly specific?

The good thing about dividing the article is small sections is that it allows readers to skip paragraphs if they prefer to do so. My only goal is for readers to focus on they topics they enjoy reading, and everyone has different interests.

The point I'm circling around is that I think you should kill that entire paragraph since I think it adds nothing but confusion.

it's a short paragraph connected to previous articles. Unless the sentences are factually wrong, I prefer to leave it for those interested in that area.

The ID_AFR0 register does contain 16 implementation defined bits which Nintendo could have used, but from over here, that sounds like a decidedly odd way of doing it.

My first finding was that this information would be found on the c0 register, but after sharing the draft with the 3DS hacking/homebrew community, I was told CPUID was used instead.

joha4270 commented 1 year ago

The good thing about dividing the article is small sections is that it allows readers to skip paragraphs if they prefer to do so. My only goal is for readers to focus on they topics they enjoy reading, and everyone has different interests.

I don't think that attitude leads to a better article, but you're the author.

flipacholas commented 1 year ago

In any case, I appreciate your efforts in helping improve the article. I've finished reviewing all your comments. Thank you!