evansm7 / pico-mac

Run the popular umac emulator right on your Pi Pico!
425 stars 28 forks source link

Add PSRAM to allow Mac Plus emulation? #7

Open ajacocks opened 4 months ago

ajacocks commented 4 months ago

Obviously, adding more RAM to the emulated Macintosh isn't really possible on the Pico, as is, as it only has ~256k of native RAM. However, adding a PSRAM module is a path that a lot of Pico-based projects have done.

Have you considered that? Here is a C library for the access to such RAM: https://github.com/polpo/rp2040-psram

Ian Scott used that, I believe, in the PicoGUS, and here is the schematic: https://github.com/polpo/picogus/blob/main/hw/PicoGUS-schematic.pdf

evansm7 commented 4 months ago

Hello! I have, yes. It’s something I’m thinking about, though for the “Pico micro mac” the purpose was very much to see what just the bare 2040 can do, and making it easy to build from spare parts lying around. (There are many ways you can make a much better emulator if memory isn’t an issue…)

On the technical side, adding PSRAM hardware is easy, but using it in such a way that it’s not slow isn’t trivial. For example, all 68k instruction and memory accesses now get about 30x more expensive; the total cost of interpreting an instruction will go up by a lesser amount but I believe it’ll be too slow to do naively. That leads to doing more complicated caching of hot external RAM locations in internal SRAM etc.

Anyway: it’d be fun to try, and at some point I’ll make a PCB up, but it’s kind of a different project if you see what I mean.

geerlingguy commented 3 months ago

The Pico 2 that launched today has 520 KiB of RAM now, and a lot more compute power (even FPU on the Cortex-M33 cores)... would that have enough overhead, or still need more?

evansm7 commented 3 months ago

Nah, still need more. (It’s not enough for a Mac 512K, though I’ll have a play with doing a Mac 400K or so :) I don’t know what that would enable in terms of new apps/OS though, and biting the bullet and going for 4MB would be a radical improvement.) The most useful part of the 2350 is the XIP/cache interface to PSRAM – that would avoid the horrible tricks I alluded to above. Hoping to prototype something this week…

geerlingguy commented 3 months ago

I'm just surprised Raspberry Pi didn't consider that they'd need more headroom to emulate Mac 512K inside the chip, ha!

doliveira4 commented 3 months ago

Nah, still need more. (It’s not enough for a Mac 512K, though I’ll have a play with doing a Mac 400K or so :) I don’t know what that would enable in terms of new apps/OS though, and biting the bullet and going for 4MB would be a radical improvement.) The most useful part of the 2350 is the XIP/cache interface to PSRAM – that would avoid the horrible tricks I alluded to above. Hoping to prototype something this week…

Looking forward to read about that RP2350-based prototype! ;)

jamesfmackenzie commented 3 months ago

Nah, still need more. (It’s not enough for a Mac 512K, though I’ll have a play with doing a Mac 400K or so :) I don’t know what that would enable in terms of new apps/OS though, and biting the bullet and going for 4MB would be a radical improvement.) The most useful part of the 2350 is the XIP/cache interface to PSRAM – that would avoid the horrible tricks I alluded to above. Hoping to prototype something this week…

Very excited to hear more on this one! 😎

evansm7 commented 3 months ago

image

Very much just a proof of concept, but it’s alive! This has PSRAM (on a Pimoroni PGA2350), and SD, with original video. It runs 7.5.5 as a 4MB Plus-ish machine; most things seem to work (MacPaint hehe, and even Shufflepuck!).

The video DMA needs work as it’s dropping CPU performance in the current approach – it’s not as fast as the plain internal RAM 128K. Then I need to clean up the build so as to target 2040, 2350, 2350+PSRAM, and SD-or-internal-flash for each of them.

So please be patient! It’ll get there. I also intend to get DVI going, design a board etc. Lots to do but sharing an early milestone as yall seemed interested! :):)

geerlingguy commented 3 months ago

Oh wow, this is awesome! Following very closely—I hope to do a pico-mac build with the 2040 soon, and it'd be cool to have a design that can go straight on to Mac OS 7.x with the 2350!

jamesfmackenzie commented 3 months ago

Agree with @geerlingguy - awesome progress and very excited to hear more as it develops! 😎

doliveira4 commented 3 months ago

Nice work!

A thought:

Cheers & do keep us posted ;)

RonsCompVids commented 3 months ago

As long as we can realize the dream of running Mac Paint, it's gonna be great! ;)

evansm7 commented 2 months ago

:D OK, I split out the SD support and just pushed that -- and cleaned up some stuff in umac, which can now patch the ROM memory sizing/memTop routines so that you can use weird (non power-of-two, unsupported) memory sizes.

Upshot is (wildly off topic) the RP2040 can run as a "Mac 208K", and with the R/W SD boot volume it can run MacPaint :D

evansm7 commented 2 months ago

Hello,

A thought:

  • Have you considered a (dynamic) cache of the most used Mac RAM / PSRAM in SRAM?

I have, and this is a really interesting approach. It's what I had in mind early in this thread when I was making "it's complicated" noises, for PSRAM/external RAM on a RP2040. On that part, with nothing memory-mapped, you have to do intentional block read/write stuff, and both knowing when to do that and protecting regular accesses from the latency becomes the key goal. It's a software-implemented cache, and hard to make really fast. I was thinking something a little like a software TLB, except with a really small "page size" (ca. 32 bytes, for a cache line). So every access would pay a small overhead (e.g. a few instructions to mask address bits, load a cache tag, compare it, and then access the (SRAM) block if it's a hit. Or, if a miss, DMA stream a new block in, and so on. For the RP2350 and the XIP PSRAM, I don't think it's going to be worth it (the HW already presents the memory mapped into an address range which is just too handy to ignore). But it's interesting.... (thinking about whether there's a sweet spot of block size versus time it takes to transfer it)

evansm7 commented 2 months ago

I done experimentz:

This was to examine the concern of putting the frame buffer in PSRAM and then reading it via DMA through the cache, which might displace "useful stuff" from the cache. Ideally mostly read-only data like video wouldn't get allocated (annoyingly the RP2350 doesn't have any kind of "read, lookup in cache, no allocate" or write-through coherent XIP window), and the only way to do that is to read it with uncached methods (either the XIP UC region, or XIP streaming mode).

As a baseline, with video DMA coming from PSRAM (via cached XIP region, data is guaranteed correct), on a 200MHz RP2350 I was getting about 1.53 MIPS at a 7.5.5 desktop at idle, about 1.345 in Missile Command (MC) paused, 1.335 unpaused. (I did two as the paused version is much more stable to measure.)

Moving to plain uncached video DMA (from the XIP uncached region) but with IRQ-driven cache maintenance required to push out any cached frame buffer updates from the CPU, it got about 1.50 MIPS at desktop idle (!), 1.34 MC paused, 1.395 MC unpaused. A little slower in some, 4% faster in others.

I also tried ... for ages ... to get XIP streaming stuff working reliably with PSRAM. That gave 1.60 at idle, 1.395 MC paused, 1.40 MC unpaused. A little better still, but from RP2350 DS sec 4.4.3:

The XIP subsystem performs these reads in the background in a best-effort fashion. To minimise impact on code executed from flash whilst the stream is ongoing, the streaming hardware has lower priority access to the QMI than regular XIP accesses.

I believe this is causing the large amount of shimmering/visual artefacts I'm seeing. The current build has a lot of code in flash, and uses PSRAM reasonably heavily, and so perhaps little old XIP streaming mode doesn't really get a look in. This mode was really quite unusable for video. (I may be holding it wrong: there is at least one bug since the top line or two of video are chopped off, as though the XIP streaming data is starting 128 bytes later than requested, and maybe there's something else going on. I added an extra DMA from PSRAM to a bounce buffer in SRAM (and then when that sucked, added 2 more to try to smooth it out, no luck).)

Uncached stuff needs cache maintenance: I was hoping for all of this to be entirely "automatic" from the DMA system, and tried triggering XIP cache maintenance accesses from DMA, which failed (actually did weird things to memory). This was a long shot -- I was trying to DMA to XIP_MAINTENANCE_BASE with an unaligned word address (addr[2:0]=3 for clean, and size=32b because I wanted the stride to be as large as possible). I really hope I'm doing this wrong too and that there's a way to clean a range "in hardware", but doesn't seem to be from what I see so far.

So I ended up just enabling the "XIP uncached" version; I'm not totally convinced it really helps for this application, but I want to use (preferably streaming, uncached) video data from PSRAM in other applications, and for higher bandwidths it'll get worse. Specifically for the Mac (where its frame buffer will fit in internal SRAM) I may dig out and retry the hacks to partition its memory into discontiguous chunks, i.e. bulk in PSRAM and the top with FB in SRAM, and avoid these DMA issues that way.

But that said, performance is easily faster than a real Mac Plus, so I don't want to rathole here much more.

I've pushed my WIP branch here, https://github.com/evansm7/pico-mac/tree/wip-psram with:

IMG_4797

...I'm using something like -DPICO_BOARD=pimoroni_pga2350 -DUSE_PSRAM=true -DUSE_PSRAM_CS=PIMORONI_PGA2350_PSRAM_CS_PIN -DUSE_SD=true -DSD_TX=35 -DSD_RX=32 -DSD_CS=33 -DSD_SCK=34 -DSD_MHZ=10 for cmake options. (Pretty rushed soldering!)

Anyway. That's video. Other thing I wanted to note is by moving the Mac's memory into PSRAM, the internal SRAM is largely empty, so I tried:

This uses a total of 457KB (quite some free still!). I haven't single-stepped but this should mean that the entire CPU execution loop is out of SRAM, and fast. This then got up to 1.89 MIPS at desktop idle, 1.805 MC paused, 1.81 unpaused -- so a nice 23% boost. (I haven't checked this in -- it's removing the const on the Musashi m68ki_static_instruction_jump_table declarations and putting M68K_FAST_FUNC on everything in m68kops.c)

Now gotta go do 100 other things... :-) If anyone tries any of this code LMK how you get on.

TimRustige commented 2 months ago

Hi Matt, I've made and ordered a PCB to use with the PGA2350 branch, then noticed you changed the video pins on your protoboard build photo. Can you tell me which source file contains the base video pin setting so I can change it back to the same as the older builds before compiling. I did look at the video.c and pio files but can't see it. Thanks Tim.

Screenshot_2024-09-16_19-06-13

evansm7 commented 2 months ago

Hi Tim, the only reason I did that was because the original pins (now) overlap the HSTX on the 2350, and I wanted to leave them free so I could experiment with DVI later on. If you’re not using that on your board then the original pins should work fine. The base setting is in video.h IIRC (or did I wire it out to CMake? I don’t think I did that yet).

Gadgetjack99 commented 1 week ago

Could you dump a uf2 with the sd and 208k enabled? I am having trouble compiling on my end.