MiSTer-devel / NES_MiSTer

GNU General Public License v3.0
169 stars 73 forks source link

Stand-alone PPU adaptation #337

Open loglow opened 1 year ago

loglow commented 1 year ago

Hello!

I'm looking to adapt just the PPU code (ppu.sv and anything that it depends on) from this project to run on its own. It would run on a suitable FPGA chip with enough physical pins (>40) to replicate the behavior of an original PPU closely. The behavior of an RGB-output PPU (eg. RP2C03) would be the primary target. The behavior of a composite-output PPU (eg. RP2C02) could be a nice secondary target if it would be reasonably straightforward to do so.

I don't have much experience with FPGA development nor Verilog / SystemVerilog. I do have hardware development experience and would be able to handle all the physical design aspects of this project. The end result would be an assembled purpose-built PCB to be used as a stand-in PPU replacement.

I do have some funding available for someone who can assist with this project. The results of this effort will be entirely open-source. Any new, modified, or derivative code would of course be released under the GPL as required, and all board design files will be released CC BY-SA as part of the existing TinyNES project.

If this sounds interesting and/or you're willing to assist, feel free to discuss here, or you can also contact me directly at dan@tall-dog.com and we can talk further about how the funding for such work might proceed.

Thanks, and take care!

Dan

Kitrinx commented 1 year ago

The PPU as it stands in the NES project is not suitable for a pin-accurate replacement to a real PPU. The FPGA implementation largely abstracts away the need for the ALE pin (address latch enable) and the multiplexing of the associated address pins, as well as composite generation entirely in favor of RGB output. There is a branch in my repo that makes the PPU more closely match the output of the real PPU in terms of timing and clock behavior, but the ALE pin is untested in that as it's still unused, and it still does not have composite generation.

I think with enough work you could make it function the way you want, but it's going to require writing some code and a lot of testing, it won't just be a drop-in task.

loglow commented 1 year ago

Thank @Kitrinx for the reply and analysis, much appreciated!

The (secondary) requirement of composite video output can be deferred, at least for the time being. The priority would be RGB output on pins 14, 15, and 16 (like an RP2C03) along with a composite sync signal output on pin 21. So yes, basically a drop-in replacement for an RP2C03 or similar chip.

I didn't expect this to be a no-brainer task, which is also one of the reasons that I think it would be an enormous amount of work to do myself, being a beginner to FPGAs and Verilog. That's also the main reason why I'm prepared to pay someone for the work involved in getting the functionality to that point. I should mention that I would also cover the cost of any parts or supplies in such an arrangement.

@Kitrinx, is this something that interests you, and if so, do you think you have the skill level and availability to accomplish it?

From my perspective, NES_MiSTer looks like the open-source project that's already closest to this goal, which is why I thought it made sense to approach the folks here about it.

Kitrinx commented 1 year ago

I can't really commit to any kind of time frame but I'm willing to assist in your efforts. There's probably some hardware comparisons to a real PPU that have to happen to get the timing of some things just right, particularly with the ALE stuff, and i'm not set up to do that here, so that would be up to you.

loglow commented 1 year ago

@Kitrinx, awesome, that sounds great!

Here are some of the basic things on my mind at the moment regarding this:

Looking forward to hearing your thoughts!

Kitrinx commented 1 year ago

I've used visual 2c02 extensively in working on the PPU and it's 2A03 counterpart for the APU. They are helpful for a lot of things but excruciatingly slow, and can't be assumed to be accurate on an analog level, ie the time of a rising edge, etc.

I can't comment on how much space the PPU would take up as ALMs and LE's on other FPGA's aren't really 1:1, and there's features in the PPU that don't need to be there on an external implementation, like extra sprites and save states. It does need enough spare block ram for the OAM ram * 2 and the palette ram. I think almost all FPGA's will have that available.

As for the palette, the differences between PAL and NTSC palettes aren't really striking, per se. PAL will generally have less skew to it's chroma angles at higher luma values, but this is also the same for early model famicoms as well. If there's one thing i've learned, it's that nobody will ever be happy with only a single palette. You can use mister's built-in selections as a good guide to what some of the most popular ones are though.

The PPU has a few options that need consideration: 1) NTSC, PAL, or Dendy behavior 2) Famicom or NES reset behavior 3) Extra sprites (this will likely break a lot of cart chips behavior, but work for some things) 4) Dejittering. I dunno what you plans are around this but it would be desirable for a lot of modern devices. 5) The behavior of 2A03 isn't too difficult, why not just put the whole damned thing on an fpga if you're going to do this anyway, why just the PPU? 6) ??? 7) MiSTer?

loglow commented 1 year ago

@Kitrinx, thanks for the reply.

I do see the usefulness and limitations of the visual simulations that you described.

How does one go about determining FPGA sizing requirements? Likewise, is there a different class or family of FPGAs (other than ICE40LP) that you would be inclined to recommend? It would make sense to remove any irrelevant code such as save state support, etc.

It's true that it's impossible to make everyone happy with palettes! We could limit it to only original RGB PPU palettes since I believe simpler is better in this case, at least for now.

NTSC, PAL, or Dendy behavior

I believe all RGB PPUs such as the 2C03, 2C04, and 2C05 use NTSC timings. So this would be NTSC.

Famicom or NES reset behavior

How would this impact the functioning of the PPU itself?

Extra sprites

I'm inclined to omit this option. Without an overclocked CPU, is there a compelling reason for this?

Dejittering

Can you elaborate on this? Is this something that would be done in the chip's code?

The behavior of 2A03 isn't too difficult, why not just put the whole damned thing on an fpga if you're going to do this anyway, why just the PPU?

There are three main use cases I can think of:

  1. A drop-in replacement of RGB PPU chips on original arcade hardware such as the PlayChoice-10 and the Vs System main boards and daughter boards.
  2. A less expensive option for RGB modding existing NES consoles than using an RGB PPU or the proprietary NESRGB.
  3. An inexpensive and available RGB output option for the TinyNES.

RGB PPUs are expensive and difficult to purchase, often costing $150-$200 or more, if you can even find them in the first place. As such, having an alternative drop-in replacement is appealing. While the same drop-in replacement functionality could potentially be convenient for the 2A03, 2C02, 2A07, and 2C07, it is specifically the scarcity and cost of RGB PPUs that make this idea attractive.

As a final thought regarding palettes and arcade board compatibility in particular, It may be nice to have configuration for the following. This could either be in code only, requiring them to be set before programming the FPGA, or it could be something external such as a bank of tiny DIP switches on the board.

These goals are secondary for now, as I think the primary goal should be functional 2C03 behavior with zero necessary configuration.

Kitrinx commented 1 year ago

Generally to figure out the size you have to compile it. You could compile it in quartus and check there, but as I mentioned it's not going to be a representation of what the ICE board needs.

TinyNes, why bother. I don't even think that uses a real CPU does it? So it has the bad pulse channels and all that going on, probably not worth spending money on from my PoV. Using mister would get you more accuracy than that thing.

Upgrade kits to existing NES might be nice. Desoldering the PPU is quite a drag.

Regarding the various options: Famicom and NES PPU behavior differs on reset. On famicoms, when you press the reset button, the PPU doesn't reset, on NES it does. Additionally later PPU's hold reset for one full frame after reset button is released. It only impacts a few early game releases like Donkey Kong which won't start properly with the 1 frame reset, and RNG generation for speedrunners/TAS, really.

Extra sprites don't have anything to do with an overclocked CPU. The PPU queries the cart hardware twice instead of once per fetch cycle, able to get data for up to 8 additional sprites. Sometimes this won't work if the mapper is sensitive to this, but sometimes it will if it's a stateless mapper or it doesn't worry about counting address lines in that way.

Dejittering is the practice of taking the uneven frame length (nes will draw one fewer pixel per frame every other frame when rendering) and pausing the clocks for one cycle to make the frames even in length. Modern televisions HATE uneven frame lengths and often wont work at all with this. There are hardware mods for original hardware to address this on NES. The PPU can't do this exclusively, as it requires the cpu to pause as well, but it can cooperate or make it easier.

loglow commented 1 year ago

@Kitrinx, I appreciate the explanations. Thanks!

Generally to figure out the size you have to compile it. You could compile it in quartus and check there, but as I mentioned it's not going to be a representation of what the ICE board needs.

I'll set up an ICE40 toolchain in order to synthesize the current ppu.sv so I can see how many logic cells it would need. This will tell us if the ICE40 is a feasible chip family or not. Is it safe to assume that a standalone modification would require roughly the same amount of FPGA logic as the current one?

TinyNes, why bother. I don't even think that uses a real CPU does it? So it has the bad pulse channels and all that going on, probably not worth spending money on from my PoV. Using mister would get you more accuracy than that thing.

The TinyNES has two 40-pin DIP sockets for the CPU and PPU. Most units ship with genuine 2A03 and 2C02 chips in the sockets. Clone chips are a cheaper option for folks too. Any chips that want either a 21.477270 MHz (NTSC) or 26.601712 MHz (PAL) master clock are supported (which is most of them) since both clock sources are available. All the RGB PPUs are supported too, including 2C05s, and all the hardware is open-source. I'm the creator of the TinyNES btw ;)

Upgrade kits to existing NES might be nice. Desoldering the PPU is quite a drag.

I agree that inexpensive RGB upgrades could be awesome. Desoldering these chips is actually very easy and non-destructive if the right tools are used. It takes about two minutes to remove a PPU with a Hakko FR-301.

Regarding the various options: Famicom and NES PPU behavior differs on reset. On famicoms, when you press the reset button, the PPU doesn't reset, on NES it does. Additionally later PPU's hold reset for one full frame after reset button is released. It only impacts a few early game releases like Donkey Kong which won't start properly with the 1 frame reset, and RNG generation for speedrunners/TAS, really.

Ah, okay, now I know what you mean. If a PPU would only be more compatible without the extra 1 frame on reset, then my vote would be to omit that particular behavior. Is there any compelling reason not to?

Extra sprites don't have anything to do with an overclocked CPU. The PPU queries the cart hardware twice instead of once per fetch cycle, able to get data for up to 8 additional sprites. Sometimes this won't work if the mapper is sensitive to this, but sometimes it will if it's a stateless mapper or it doesn't worry about counting address lines in that way.

I understand now. Extra sprite querying should certainly not be the default behavior. I could see it being an option, but I don't think it's an important one especially since no original systems do it. I'd be inclined to omit it for that reason.

Dejittering is the practice of taking the uneven frame length (nes will draw one fewer pixel per frame every other frame when rendering) and pausing the clocks for one cycle to make the frames even in length. Modern televisions HATE uneven frame lengths and often wont work at all with this. There are hardware mods for original hardware to address this on NES. The PPU can't do this exclusively, as it requires the cpu to pause as well, but it can cooperate or make it easier.

What changes to the PPU would facilitate this, and would they introduce compatibility issues or be especially complicated? If not, then I see no reason why it shouldn't cooperate in this respect.

I've begun sketching out the hardware, and I'll continue with fine layout and routing once we've determined if the ICE40 family is a valid FPGA target or not.

Kitrinx commented 1 year ago

If TinyNES uses a real 2A03 it's probably okay then and a worthy goal to add some modern usability perks to it.

To judge the side you'd want to compile it with the toolchain, as you mentioned. I've never used the ICE one but usually at the end they give you a report on the size of things and the fit in some way.

You'd want to use this code here: https://github.com/Kitrinx/NES_MiSTer/blob/ppu3/rtl/ppu.sv

It's my branch where I refactored the PPU using a lot of things from Visual 2C02 to make it work much more closely to the timing and asynchronous behavior of the real chip, and it shouldn't be too far from working in a real-hardware scenario, and also much more efficient size-wise.

The extra stuff is something one can hammer out later with dip switches or something.

loglow commented 1 year ago

Hey @Kitrinx,

I'm having some trouble synthesizing ppu.sv as I'm getting a number of errors. I've omitted the full file paths for brevity, and I've also pruned similar/repeated error messages. Note that the last message is only a warning.

@E: CG342 :"ppu.sv":699:1:699:6|Expecting target variable, found oam_db -- possible misspelling

@E: CG342 :"ppu.sv":1047:13:1047:20|Expecting target variable, found temp_y_l -- possible misspelling

@E: CG425 :"ppu.sv":1372:1:1372:9|Assignment target new_color must be of type reg, genvar, or logic
(this error is repeated 12 more times, for each use of new_color)

@E: CS180 :"ppu.sv":1592:2:1592:6|Assignment target w2000 must be a register or integer
(this error is repeated 9 more times for w2001, r2002, w2003, w2004, r2004, w2005, w2006, r2007, w2007)

@E: CG425 :"ppu.sv":1594:22:1594:26|Assignment target w2000 must be of type reg, genvar, or logic
(this error is repeated 9 more times for w2001, r2002, w2003, w2004, r2004, w2005, w2006, r2007, w2007)

@E: CS180 :"ppu.sv":1604:12:1604:16|Assignment target w2000 must be a register or integer
(this error is repeated 9 more times for w2001, r2002, w2003, w2004, r2004, w2005, w2006, r2007, w2007)

@W: CG1249 :"ppu.sv":1872:5:1872:17|Redeclaration of implicit signal load_pattern2

This is with Lattice iCEcube2 (2020.12.27943) & Synplify Pro (L-2016.09L+ice40).

Any idea what I'm doing wrong here?

Kitrinx commented 1 year ago

I cannot tell from those comments as the line numbers don't seem to line up with the code in my ppu3 branch, but if I had to guess I'd say it was some aspect of systemverilog that the toolchain didn't like.

loglow commented 1 year ago

@Kitrinx,

Maybe I'm missing something, but I'm not seeing any mismatch in line numbers? (Screenshots from GitHub)

@E: CG342 :"ppu.sv":699:1:699:6|Expecting target variable, found oam_db -- possible misspelling

s1

@E: CG342 :"ppu.sv":1047:13:1047:20|Expecting target variable, found temp_y_l -- possible misspelling

s2

@E: CG425 :"ppu.sv":1372:1:1372:9|Assignment target new_color must be of type reg, genvar, or logic

s3

@E: CS180 :"ppu.sv":1592:2:1592:6|Assignment target w2000 must be a register or integer

s4
loglow commented 1 year ago

As a sanity check, I was able to synthesize https://github.com/strigeus/fpganes/blob/master/src/ppu.v without issue.

So that's good! 😀

Here's the resource report:

Resource Usage Report for PPU 
Mapping to part: ice40lp8kcm121
Cell usage:
GND             15 uses
SB_CARRY        131 uses
SB_DFF          39 uses
SB_DFFE         776 uses
SB_DFFSR        25 uses
SB_DFFSS        1 use
SB_GB           2 uses
SB_RAM256x16    1 use
VCC             15 uses
SB_LUT4         1312 uses
I/O ports: 101
I/O primitives: 101
SB_GB_IO       1 use
SB_IO          100 uses
I/O Register bits:                  0
Register bits not including I/Os:   841 (10%)
RAM/ROM usage summary
Block Rams : 1 of 32 (3%)
Total load per clock:
   PPU|clk: 1
@S |Mapping Summary:
Total  LUTs: 1312 (17%)
Distribution of All Consumed LUTs = LUT4 
Distribution of All Consumed Luts 1312 = 1312 

So, it looks like that particular code wouldn't quite fit on an ICE40LP1K, but it would fit easily on an ICE40LP8K.

loglow commented 1 year ago

I hope I didn't scare you off @Kitrinx ! Happy to provide anything I can to help facilitate this :)

Kitrinx commented 1 year ago

Sorry, have not been scared off just occupied elsewhere. I will take a look for you.

loglow commented 1 year ago

No problem @Kitrinx, I understand. Thanks for the note.

I spent some more time working on getting your files to synthesize, and I had some success. My goal for the moment was to change absolutely as little as possible in order to get things to work.

Here's what I did:

Now the error that I get is: @E: CH100 : | Encountered multiple top-level candidates in design; compilation stopped.

I'm not sure how to proceed with that. Any ideas?

As an aside, I also separately spent some time going through ppu.sv and removing everything related to a) save states and b) extra sprites. I was actually able to synthesize the result at some point, but I didn't know if doing this editing would be helpful or useful to you or not. In any case, it would be reasonably easy for me to do that again. Let me know if that's something that would be helpful, and I can provide that.

Hope all is well!

loglow commented 1 year ago

Just a bit of further clarification.

After some more fiddling, I was able to get the ppu.sv in this repo to synthesize, but not your ppu.sv file. The difference appears to be your use of dpram, which in turn depends on altera_mf and altera_mf_components. I did find copies of these two files elsewhere, but I'm still having trouble getting Synplify to recognize them as libraries. I'm certainly in over my head.

Anyway, that's where I'm at for the moment.

Kitrinx commented 1 year ago

I can help you more with it soon. I'm finishing some work to fix some input devices for atari 7800 that have been an issue for a while, then I will put my full attention on the NES PPU.

Kitrinx commented 1 year ago

If you'd like to work more directly i'm also available on the mister fpga discord.

cowrevenge commented 1 year ago

Hi, just reading through this.

Whats stopping the idea of nes to hdmi with fpga? I know one existed but its long out of stock.

library IEEE; use IEEE.STD_LOGIC_1164.ALL;

entity NES_to_HDMI is Port ( NES_video : in STD_LOGIC_VECTOR(7 downto 0); hdmi_out : out STD_LOGIC_VECTOR(31 downto 0)); end NES_to_HDMI;

architecture Behavioral of NES_to_HDMI is begin hdmi_out <= NES_video & NES_video & NES_video & NES_video; end Behavioral;

MP2E commented 1 month ago

Hi @loglow, sorry to bump an old issue. There is a new standalone PPU implementation for FPGA that can be found here:

https://github.com/andkorzh/RP2C02-7-

Check the YouTube video in the readme, it shows it working in place of a real PPU on a Famicom.

I wonder if this would be of any use the the NES MiSTer core?

loglow commented 1 month ago

Hi @loglow, sorry to bump an old issue. There is a new standalone PPU implementation for FPGA ...

Thanks. Looks awesome. I appreciate the bump @MP2E !

Kitrinx commented 1 month ago

While outwardly it might seem like using a direct netlist would be a good idea, in practice it believe it would not. It doesn't allow for clock pausing and the code is almost entirely unworkable by a human. What this means in practice is that things like extra sprites, savestates, fame size evening to keep modern tv's from freaking out, some of the analog ram decay emulation, etc would all be nearly impossible using this implementation, and at this moment, I don't see any direct improvements that would come from it.

MP2E commented 1 month ago

That makes sense, thanks for explaining