elect-gombe / esp32_mmd

esp32でmmd
97 stars 14 forks source link

Is there an option for a DMA driven QuadSPI or 8-bit parallel solution for better framerate? #1

Closed RetroZvoc closed 5 years ago

RetroZvoc commented 5 years ago

Hello. I'm considering using a DMA-driven QuadSPI communication with an LCD display or a video driver at 80MHz or an 8-bit parallel bus on the sequentially arranged GPIO pins so that the performance doesn't suffer and so that a very powerful 3D game engine akin to Spyro 2 for PS1 or Driver for PS1 could be made.

This is my idea: http://www.vsdsp-forum.com/phpbb/viewtopic.php?f=14&t=2353&p=12464#p12449

The display in question would be a parallel ILI9341 and the video driver in question would be VS23S040. If the 80MHz is too fast for QuadSPI or if QuadSPI isn't available for that display and that video driver, I would like to consider TinyFPGA that converts QuadSPI communication to 8-bit parallel communication.

Could you try one of those approaches and tell me what the maximum framerate would be? Best regards, (Codename) RetroZvoc a.k.a. Zvoc47

elect-gombe commented 5 years ago

Hi, @Zvoc47 Here's my analysis of my 3D rendering engine. Current framerate (2560 triangles, 320x240) is 24fps.

transferring cost

I know some performance issue caused by communication with an LCD, but QuadSPI module used to flash and cannot used at the same time. So, while communicating using the QuadSPI bus, it cannot read any data from qspi-flash, which storing program code and texture images. It might need more time to render. And before we do some optimization, data rate must be considered. Now, current communication method is SPI. SPI clock is around 40MHz (10MHz for input and output use, 26MHz for output only, 40MHz is overclock, but still works in my experiment and other environments;), so bit rate is about 40Mbits/sec. Screen size is 320*240, 16 bits per each pixel, 32.6 fps in theory, but because rendering and computing(3D vertex, IK, FK, etc) time is longer than transferring rendered image time, that why, actual fps is around 24fps. This is SCLK signal of one of the frame, you can see some transmission is delay due to busy rendering. image

if you change only communication method, no more fps as you might think because while drawing the partial rendered images, current program transfers rendered images to LCD independently through DMA-driven SPI. So, if you want to optimize transferring method, rendering method should be optimized too. Anyway, one of the idea to accelerate the communication speed is to change SPI to parallel bus through I2S. I2S with 8bit bus mode with DMA capability is the best bit rate to communicate with LCD. clock speed is about 10MHz, 80Mbit/s. (320x240, 16bit color, 60fps in theory) QSPI is the fastest peripheral, but it's already used to connect with storage or PSRAM.

other optimization

Here's my idea.

Save rendering time is not easy. I tried many time to reduce rendering time. Anyway, this model made of 2560 polygons, reduce polygon counts is good way to improve frame rate. Currently, specs of polygon rendering is less than that of NDS. PS1 or other platform are much powerful than that of esp32, with generic CPU.

conclusion

Oh you are a student? I'm a student of Japanese college too :+1:

RetroZvoc commented 5 years ago

This is quite a lot of information! I'm amazed with how much effort you have put into this.

Now, when I was considering the QuadSPI functionality, I was reading about that from https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/peripherals/spi_master.html this link.

The link said that ESP32 has 4 SPI masters. The first two are used for the Flash and RAM respectively. The HSPI and VSPI can be used for what I was talking about. HSPI master could be used by a DMA channel to send 4 bits per cycle (which is what I called QuadSPI) at 80MHz to an FPGA chip that would convert that signal into a parallel 8-bit signal for the ILI9341 display. The VSPI could also communicate with the FPGA on another bus same way, but that would be used to send matrix data for the FPGA to quickly calculate. That way, when you have 3D data for processing, you just set all that data into a buffer and then at the same time receive the results into another buffer. During that time, both CPU cores can do other things such as serving the interrupt routines for the Bluetooth speakers, Bluetooth joysticks, WiFi stuff, sound/music engine (XM, MOD, etc.).

My way of rendering the polygons was to have 3D bezier curves projected into 2D bezier curves along with their shading coordinates, points of inflections, etc. all by using the FPGA's matrix processing power. Then those 2D bezier curves would be analyzed scanline by scanline to see where they start and where they end and which one is visible and which one isn't. That data would be sent onto a renderer task. The renderer task would then render scanline by scanline all of those 2D bezier curves along with their colors and textures and shadings which would have been previously precalculated by the FPGA. All of that would go into the scanline buffers.

If you look closely to a shaded ball or a shaded polygon by observing its shades on a horizontal line, you'll see it looks like a quadratic or a cubic function curve. The way a polygon would be rendered would be by using a quadratic/cubic equation per each scanline giving a shading curve (y=ax^2+bx+c where a,b,c are the scanline's parameters and where x is the current x pixel and y is the output). The FPGA would calculate those a,b,c parameters. The shading curve output (y) would tell which shade would be used. So if you're rendering a orange ball, you'd have color16_t orange_ball[256] holding 256 different shades of the ball and you'd just copy the result into the buffer.

*(renderer_buffer)=orange_ball[a*x*x+b*x+c]; renderer_buffer++; x++;

There could also be a possibility of a just-in-time compilation of these loops by the FPGA so that the ESP32 doesn't have to do these calculations, but that it just executes the code that the FPGA spits out via the QuadSPI communication.

And if you look at many cartoony characters or Anime characters, you'll see that many shapes repeat. You'll see cyllinders and cones that could be used for arms and fingers. All of those primitive 3D shapes are actually curves which don't have to end up as ugly triangles, but they could be turned into nice 2D bezier curves. All of these bezier curves can be sorted into an array for the renderer task to just render. Same way, the renderer task could reject the curves that aren't seen to prevent redundant rendering. And when generating a level's 3D perspective, only those polygons that are seen should be rendered. This, along with level-of-detail feature, is used in Spyro 2 for rendering huge levels with any XYZ vector of position and direction.

Now, you mentioned the I2S. What is this I2S anyway? I thought that I2S is the audio circuit for the headphones and the microphone. How could that possibly be used? The only way I saw it could be used was a PAL TV which bitluni has explained in his video. That would be the cheapest solution, but I'd need a good chip to convert that analog signal into an LCD signal so that my console can be portable.

I'm not sure if you get the idea, but this has been my idea for quite a while, but I couldn't really find a good display for this and I'm broke so I need to be careful about where I spend my money. Just saying in advance, I'm planning to use this engine for my device so that I can go to Kickstarter and get money to meet many developers in real life and to have my retro game console enterpreneurship done in a good way. Since you're very advanced in this stuff, I would love it if you could join me.

And yes, I'm a student. I study Informatics at Zagreb University of Applied Sciences with an Electronic Business direction. I chose that because it was closest to game development which is what I want to do. I'd love to collaborate with you. :) What do you say?

elect-gombe commented 5 years ago

Hi @Zvoc47

QSPI

I didn't realize that esp32 has two or more SPI with QSPI mode. Maybe we can use QSPI as 4bit DAC and output some color NTSC or PAL signal, but it's difficult and memory profile is probably high. QSPI communication latency is long, FPGA is good at matrix calculation, but transfer cost and latency, overhead is too high for realtime rendering. It might be a good idea to use FPGA as a GPU, task division policy is very important.

I2S

I2S is sure sound interface, but esp32 I2S peripheral of esp32 has another mode that can be used to connect LCD, LCD mode. LCD mode has many functions such as camera mode, ADC, DAC or LCD, including 8 bit parallel mode. Here is an LCD example.https://www.esp32.com/viewtopic.php?t=1743 twitter video it's super smooth, isn't it?

3D techniques

Recently, 3D graphics need texture mapping, and shading table cannot be used because they has lot of colors, about 60000, the table became too huge. I not really know 2D curve techniques to 3D, but hidden-surface removal is one of the key in the 3D CG, you must think about how to remove hidden-surface. Polygon with linear interpolation is the frequently used technique. Even blender which supports ray-tracing is not support fully in curve, it draws only polygons. Some 3D curves is converted into polygons. And some low polygon model are often applyed subdivision surface techniques. This is commonly used even so filmic movie. Bézier surfaces in computer graphics

collaborate

PS1 emulation project,,

PS1 is featureing GPU accelerated graphics, no enough power by using esp32 or poor FPGA devices. Can you tell me the detail of your project? What kinds of task do on FPGA and others do on esp32?

RetroZvoc commented 5 years ago

In the VSDSP forum, I mentioned VLSI Solutions's VS23S040 video chip which can be used as a display/graphics driver. There's also another chp that goes with that and its job is to convert analog RGB or composite (I'm not sure which) video into LCD signal. That way we could have two different outputs; TV and LCD.

But now that you mention the I2S interface, I'm surprised. I thought that this could only be used for two DACs for generating NTSC and PAL like bitluni did here: https://www.youtube.com/watch?v=-JXuwwXQh8c . Now you've given me new ideas! Too bad I left an 8-bit parallel LCD at home. I had it with my NODEMCU-ESP32S board. I had it connected and I was trying to write my own driver, but ragequitted because so many things failed to work. The display in question is https://www.chipoteka.hr/artikl/132080/zaslon-za-arduino-28-touch-za-arduino-unomega-8090229161 this one. It's in my language, sorry, but I'm not sure which driver this is. This came with some MCUFRIEND_mkv library that was full of lazy bodges that autodetect what display this is and they don't even contain a working Arduino Uno example for the parallel interface! Also, the touch panel pins are also used for the LCD data pins which I accidentally realized while testing the display. Could you please identify this display for me and link to me a Git project with this display? How would we go about this in I2S interface?

Regarding the 3D techniques and the texture mapping, I was thinking that the ESP32 would fill in a DMA buffer with input variables for matrix calculations and other functions and send that to the FPGA chip while simultaneously receiving to another DMA buffer the results. That way there's no need to manually calculate things. I think that that's how Sega Saturn has it done. https://www.youtube.com/watch?v=n8plen8cLro This video explains a DSP working with 3D operations. We could put the 3D animation data, for example this MikuMikuDance character, into the FPGA and then receive the output coordinates to render at. If the FPGA could take a 4-bit 80MHz communication, and if every number was a 24-bit fixed-point fraction, that means that a number could be effectively be sent at 13.33MHz. Divide that with the number of bodyparts, bodypart polygons, bodypart bending coordinates, and divide by 60, and you'll get the number of transformations possible to process at 60FPS.

Now, regarding the collaboration, I wasn't thinking about emulating the PS1, but about making a console that's as graphically fancy as a PS1, GBA and SNES, but which has the computer-ish capabilities as those good-old Sony Ericsson featurephones with VideoDJ, J2ME apps, Opera Mini, MiniCommander, PaintCAD, VibeJive, eBuddy, etc.. Basically a hybrid of a game console, multimedia player, mobile phone and pocket computer.

There are many things to do with the ESP32. It could be possible to have 2 16kB RAM banks for multitasking. One would be used for the active process while the other would be saved to the serial RAM while loading the new one. That way, you'd have one of them being copied at 80MHz while the other one is executing. That way there's a possibility of having processes like a MIDI player, 3D engine, a game engine, a VM for game scripts, etc. running simultaneously. I just hope that that won't halt the CPU. If it does, I'm willing to use the serial RAM manually as long as it has the feature to write bytes at one page while reading bytes from another and as long as the ESP32 can write them off of a 16kB page while reading them into the same 16kB page. That means that approximately 2440 proccesses could be transfered in just one second; which is 40 in a frame at 60FPS. That way if we have 2 such 16kB RAM bank pairs; one for each ESP32 CPU core;, we could have a pipelined game engine-ish OS running in the game console with each core processing 20 different processes in one frame. Such processes could be GUI, 3D rendering, 2D rendering, game physics, game engine, VMs, game scripts, MIDI/MOD/XM tracker, etc. while the rest of the ESP32's immediate RAM would be used for graphics, pallettes, music tracker patterns, MIDI commands, character animations, etc..

As the FreeRTOS allows for real-time operations, we could calculate the perfect timing and arrange correctly when which data is being sent and when which process is doing its job. For example, a character is walking because the game's code set the walking animation. Now the walking animation and the 3D coordinates are sent over the DMA to the FPGA which is calculating that while the ESP32 has just received a different task such as a XM music module step function to play the music a little bit, and then the 3D rendering engine is finally loaded from serial RAM just as the FPGA has sent to the ESP32 the latest calculations for the 3D engine to now process. The 3D engine now tells the FPGA to project those coordinates to a 2D plane and to tell the upcoming 2D engine what are the curvatures and other parameters regarding texture interpolation and such. Then, DMA>FPGA>DMA while a network manager is now loaded for syncing WiFi and Bluetooth gaming and such. When the DMA>FPGA>DMA finishes, the newly loaded results from the FPGA are now ready for the newly loaded 2D rendering engine to order all polygons and GFX elements correctly into a frame structure before the real-time scanline buffer engine is then ready to render the next frame.

As all of that would be too complicated for regular people to understand, we'd need a VM, interpretter and JIT-ter like MicroPython so that the gamers and programmers can use it to make their own games and apps for the console. I'm imagining many such processes being very lightweight because many of the standard functions like string functions and date functions and etc. would be already in the ESP32's firmware and RAM always loaded without the need of bankswitching. That way all of this system stuff would be let alone for advanced devs while the beginners would make their MicroPython games.

elect-gombe commented 5 years ago

@Zvoc47

vertex calculation and other tasks

Actually, vertex calculation is not heavy. 2D tasks such as rasterisation is much heavier than vertex calculation. That why many old GPU contains much pixel shaders than vertex shaders. Rasterisation is not lightweight task. Actually, projection uses less than 10% of CPU resources, others uses in rendering and IK, FK.

spi flash and cache

By the way, esp32 has external flash and 4MB address mapping with 16KB cache capability, no effort to manage it efficiently transferred. No bank system in the esp32, because it has 32bit address space. (about 4GB)

I'm wondering how effect such DSPs module can accelerate 3D tasks including rasterisation. PSRAM cannot use through DMA, so internal RAM is quite important. For example, the video in PAL video output, uses 150KB of internal RAM. MicroPython ports, good in only in 2D applications. In 3D rendering, projected coordinates uses 3D coordinates or even 4D if we use texture map, or construct it's world with no delay, it's very heavy task. ESP-IDF or Arduino for esp32 is the good performance for gaming.

RetroZvoc commented 5 years ago

I think that that's why 2D rasterization needs to be planned out.

Speaking of which, I've recently been unable to try out my ESP32 with a screen so I'm sorry for not responding in a while. I'm wondering how this I2S buffer functions. How does the DMA work? How does the ESP32 know which pins of the display are which pins on the ESP32 GPIO matrix? How does it know where are the data pins and where is the strobe? Also, the I2S driver that's included in the link you've sent to me isn't for the ILI9341 display, but some other display. What do I need to do to change it to work with my ESP32? Also, I think that with my last wiring of pins, I've wired my display to pins P12-P19 for Data0-Data7, rst=p32, cs=p4, rs=p0, wr=p2, rd=p35. I wired the databus in a consequential set of pins which aren't critical for other things so that it's possible to use the Write1ToSet and Write1ToClear IO registers and bitshifting to do a faster writing and strobing and etc.. What approach should I take? And could I please ask you for a favor of giving me a piece of code that would make this solution work for my wiring and display which I've specified?

So if the RTOS's caching is automated, I guess I don't have to worry about that. And speaking of DSPs, I was disappointed to find out that the DSP core can't be programmed on the go, but that it's preprogrammed at factory time. Meaning that there's no microcode programmability. But what there could be is that we write a JIT compiler that will dynamically compile unrolled loops which take less CPU cycle to do a rasterizing of a 2D-ified 3D polygon/bezier curve.

The 2D world would be buffered from the PSRAM like how PS1's Driver games have it working in case it's an open world. In the case of a closed world, the Spyro 2 approach could be taken.

elect-gombe commented 5 years ago

@Zvoc47

information about the LCD

esp32 reference manual

LCD with parallel bus

another lcd example ↑, I2S parallel bus mode code is here

ILI9341 datasheet

This is LCD example, and reference manual of esp32. I don't have parallel LCD, so I can't test parallel LCD with I2S peripheral. Each LCDs have an own initialize sequence. Take a look at the datasheet. Some LCD has an own connective mode such as 16bit parallel bus or 8bit, SPI, etc... I don't know which LCD you are using and is it sure 8bit parallel mode. You should be check your connection, or give me more information regarding this LCD you are using.

performance

Loop unrolling is not so much effective for complex tasks, such as rasterization. The architecture of esp32 CPU is so poor, even 240MHz, many cycle to calculate especially in floating point calculation.

If you think of using FPGA, with many LUTs and DSPs, 3D renderer can be implemented in 4, 8 or 16 lines scan-line method, this esp32-mmd project adopt this method which processing with CPU version, for less memory and more performance. This algorithm is easy to implements for FPGA too. Rendered image (with few lines) transfers to esp32 or directly transfers to LCD. Or you can use high performance microcontroller such as STM32F730, raspberry pi zero, is powerful enough to compute these 3D CG. What do you think?

elect-gombe commented 5 years ago

@Zvoc47 Feel free to reopen if you have further questions.