Open rickyzhang82 opened 6 years ago
@OtherCrashOverride It seems that you disable double buffering inside nofredno. Why?
#if 0
back_buffer = bmp_create(width, height, 0); /* no overdraw */
if (NULL == back_buffer)
{
bmp_destroy(&primary_buffer);
return -1;
}
bmp_clear(back_buffer, GUI_BLACK);
#endif
I will restore it and use swap instead of memcpy.
It was likely disabled during early development. I am not currently aware of any issue preventing its use.
I'm not saying there is an issue in your modification. I just wonder why you are doing this. I understood that you can NOT dynamic allocate memory in Arduino. Perhaps the same applies to ESP32. Thus, you hack bmp_create. In any case, there is no trace in your change. That sucks.
My goal is to add back a proper double buffering implementation so that I can experiment interlacing and progressive update.
I found the original source nofrendo where double buffering implemented in NES emulator application level. It swap primary buffer and back buffer when flushing
https://github.com/rickyzhang82/nofrendo/blob/master/src/vid_drv.c#L359-L362
/* Swap pointers to the main/back buffers */
temp = back_buffer;
back_buffer = primary_buffer;
primary_buffer = temp;
I will allocate two buffers in advance for primary and back buffer and swap them in your custom_blit function. More advance experimental stuffs will be done there later.
PS: Don't blame the source mixed with tab and space. The original nofrendo code use 3 spaces as indentation. It is nice and clear. We should fix it in your upstream. It is very irritating to me.
I did an experiment by restoring primary buffer and back buffer. Swapping them after custom_blit function.
But I saw more serious screen tearing problem than before.
I still tried to figure it out why. But cloning a copy of frame buffer by memcpy before sending it to video queue seems to solve the problem.
@OtherCrashOverride I got stuck in debugging serious tearing problem due to enabling explicit double buffering (see my previous Youtube video)
I made two changes:
custom_blit
is called. memcpy
in custom_blit
and directly send frame buffer for SPI data transfer.Reading through ili9341_write_frame_nes
, this function won't return until SPI transaction is completed. This is done by polling spi_device_get_trans_result
.
Between xQueueSend from custom_blit function and xQueueReceive from video task, they should synchronize that xQueueSend have to wait until xQueueReceive clear the slot.
This alone enforce that before buffer swapping is done, current frame buffer data (either primary or back) has been sent over SPI. This guarantees that NO overwrite in frame buffer during transferring.
But the test show otherwise. Is it possible that the completion of transferring SPI data doesn't mean that LCD finish display the result?
See snippet of custom_blit and video_task.
static void custom_blit(bitmap_t *bmp, int num_dirties, rect_t *dirty_rects) {
if (bmp->line[0] != NULL)
{
xQueueSend(vidQueue, &(bmp->line[0]), portMAX_DELAY);
ESP_LOGD(TAG, "xQueueSend bmp(%p) to queue.", (void *)bmp->line[0]);
}
}
and
static void videoTask(void *arg) {
uint8_t* bmp = NULL;
while(1)
{
BaseType_t hasReceivedItem = xQueuePeek(vidQueue, &bmp, portMAX_DELAY);
if (bmp == 1)
break;
if (hasReceivedItem)
{
ili9341_write_frame_nes(bmp, myPalette);
ESP_LOGD(TAG, "Sent bmp(%p) over SPI.", (void *)bmp);
odroid_input_battery_level_read(&battery);
xQueueReceive(vidQueue, &bmp, portMAX_DELAY);
}
}
Here is my experimental branch: https://github.com/rickyzhang82/go-play/tree/exp-enable-double-buffering
This guarantees that NO overwrite in frame buffer during transferring.
No such guarantees are made. The buffers are passed by reference (pointer) so it is possible to overwrite data while it is being rasterized.
Is it possible that the completion of transferring SPI data doesn't mean that LCD finish display the result?
The completion is asynchronous. There is no concern for their completion other than that a transaction is available for use. This is in contrast to blocking for completion.
@OtherCrashOverride
No such guarantees are made. The buffers are passed by reference (pointer) so it is possible to overwrite data while it is being rasterized.
Let me put it this way. Two different overwrite happens here:
The guarantee refers to point 1. I think this is true as I explain in
Reading through ili9341_write_frame_nes, this function won't return until SPI transaction is completed. This is done by polling spi_device_get_trans_result.
Between xQueueSend from custom_blit function and xQueueReceive from video task, they should synchronize that xQueueSend have to wait until xQueueReceive clear the slot.
My question is that once frame buffer data stays in internal GRAM of LCD controller, are there any guarantee that LCD show the frame before next SPI transaction starts?
As mentioned before, the LCD display will update over 2 times (70 fps) https://github.com/OtherCrashOverride/go-play/blob/73337cb7803f38fe15eacbe09a2816ed9f62009c/odroid-go-common/components/odroid/odroid_display.c#L102 before a single complete frame can be sent over SPI (40Mhz). When the LCD refresh happens, what ever has been sent is display whether its complete or not..
That makes sense to me. Because frame buffer SPI transaction to internal GRAM of LCD controller is not in sync with LCD refresh rate. SPI transferring speed limit the frame rate to be half of LCD refresh rate. I think I should reduce SPI traffic by using techniques like interlacing and minimal rectangle update.
In any case, I found that if I follow your old way using memcpy to make a copy of primary/back frame buffer into a separate frame buffer in the heap and then do SPI transfer from that separate frame buffer, I don't see any severe tearing like my Youtube video. Note that the separate frame buffer is not defined as DMA_ATTR
as primary/back frame buffer does. So I really don't get the secret of your dark magic. Your way should be under the same constrain that you stated in previous thread as well. But somehow yours work.
An off-topic: I saw that you throttle rendering frame rate in nes.c.
bool renderFrame = ((skipFrame % 2) == 0);
nes_renderframe(renderFrame);
system_video(renderFrame);
if (skipFrame % 7 == 0) ++skipFrame;
++skipFrame;
I saw NES emulator 60FPS in output. Does it imply you throttle rendering frame rate in 30 fps or something? I don't quite get if (skipFrame % 7 == 0) ++skipFrame;
.
DMA is never done from the frame buffer. The data has to be processed by the CPU first.
As stated in the first post, the SPI LCD has a maximum update rate of 30fps. The emulators run at 60fps and discard frames. The "skipFrame % 7" introduces an irregularity to whether odd or even frames are discarded.
DMA is never done from the frame buffer. The data has to be processed by the CPU first.
I see your point. It is ili9341_write_frame_nes
. I will carefully review it.
Thank you @OtherCrashOverride & @rickyzhang82 for keeping this discussion public. I've been reviewing the current state of go
and have been testing a few things as there are a ton of ILI9341
libraries, some of which stem from Adafruit's OSS library.
I'm going to carefully watch this thread as the tearing is probably the biggest limitation to this dev kit that people will complain of. Thank you for your hard work =)
@OtherCrashOverride I have several interesting findings in my experimental branch:
After reviewing your code in ili9341_write_frame_nes
function in odroid_display.c
. You are correct that DMA is done in line array there. So I don't have to define DMA_ATTR
attribute for primary/back frame buffer. I think I will implement interlacing and delta update there. That's the right spot to try that logic.
I read FreeRTOS doc to learn task and queue. I added ESP_LOGD
to confirm that my implementation of double buffering doesn't introduce race condition of read/write in double frame buffer between xQueueSend
and xQueueRecieve
where swapping frame buffer and DMA from line
array to LCD controller are done. In fact, I suspect that the serious flickering in my previous Youtube video is caused by the speed differences between LCD refresh speed and SPI transferring speed. I can see that printing out from ESP_LOGD
slow down frame speed from 60 fps to 47 fps. The serious flickering has gone. See Video
@jaygarcia It is not my ideas at all. I got two ideas interlacing and delta update from juj's repository. His ili9341 work build on RasPi. But the improvement algorithm may apply to ESP32.
hi @rickyzhang82 i was also looking at the same interlacing example and was wondering if the overhead of transactions was too high for interlacing to work well, as each row "paint" would require a whole new transaction, resetting the starting address w/ the LCD and it's required overhead.
In regards to your point #2 above, are you swapping memory pointers at the end of of the SPI transfers? E.g. After send_continue_wait();
call?
Current implementation ili9341_write_frame_nes
sends 5 lines in every SPI transaction using memory continue write command 0x3c
, while the proposed interlacing or delta update (minimal rectangle update) will use at least 3 additional SPI transaction 0x2a
(column address), 0x2b
(page address) and 0x2c
(memory write) for each line/rectangle update. For interlacing, if the overhead exits, I'd estimate that at least 2.5 times more SPI transaction in writing memory than the original approach. I don't know the overhead until I get it implemented.
I swapped the buffer after xQueueSend
. By the time there is one and only one empty slot available in the vidQueue
, it implies that xQueueReceive
has been done so that the slot has been freed. Because ili9341_write_frame_nes
finished before xQueueReceive
. send_continue_waite
must been done within ili9341_write_frame_nes
. So the answer is definitely YES. See code
Thanks @rickyzhang82 =). I've been trying to to follow your fork in addition to learning the codebase and your explanation has helped me =)
Sorry I'm trying to understand, does this mean the tearing issue might be fixed in the future?
@OtherCrashOverride
I encountered heap corruption when using spi_device_get_trans_result
to synchronize dual line
buffer in DMA over SPI bus.
I suspected that loading states from uSD in core 0 over SPI causing the problem Because the heap corruption happened around the time it tries to load saved state from uSD card.
I saw the original send_continue_line
and send_continue_wait
in odroid_display.c
using spi_device_get_trans_result
to sync after the whole screen data transfer is done. But spi_device_get_trans_result
only calls one time only. I don't have a doc to explain if transaction number should match between spi_device_queue_trans
and spi_device_get_trans_result
. But the sample code from ESP-IDF does match.
@JasonB32 It is pure experimental work. I'm working on interlacing now.
Ohh you're a very good man @rickyzhang82 Thank you!!!
@rickyzhang82 The LCD SPI support has been reworked in the smsplusgx branch.
@OtherCrashOverride did you notice a significant improvement in migrating from task queues to semaphores? =)
@OtherCrashOverride
I noticed you convert task queue into semaphore.
Using bitmask 0x80
on t->user
to release semaphore on line buffer looks like a good trick.
The original approach relies on spi_device_get_trans_result
doesn't seem to work at all. Once the number of spi_device_queue_trans
goes up, it is hard to match spi_device_get_trans_result
to synchronize correctly.
PS: I commented out loading uSD card loading code. The heap corruption error is not related to shared SPI bus between LCD and uSD card. I will look into your approach and see if I can borrow your idea to make sure spi_device_queue_trans
and spi_device_get_trans_result
matches.
@jaygarcia I tried on smsplusgx branch. I didn't see obvious improvement in terms of tearing.
I figured out the heap corruption problem. It is because of mismatch between spi_device_queue_trans
and spi_device_get_trans_result
. Every one spi_device_queue_trans
function call should have one and only one correspondingspi_device_get_trans_result
.
In any case, I implemented interlacing on my exp-interlacing branch.
The result doesn't look great as I thought. This reminds me of the old days when my using Sony DV cassette tape, importing video by firewire and converting them by ffmpeg with interlacing enabled.
See demo video. But this is a good start for the next progressive update with minimal delta.
@jaygarcia, I don't see SPI transaction overhead is an issue so far. I can see that it can stay around 60 fps in interlacing.
@OtherCrashOverride I don't migrate my interlacing experiment to your new branch as I haven't seen the real benefit yet. I may revisit it when I finish progressive update next.
@jaygarcia The semaphore implementation is both more reliable and also more efficient (faster).
@OtherCrashOverride
The previous task queue design did have reliability issue but this is due to the fact that mismatch between spi_device_queue_trans
and spi_device_get_trans_result
.
I'm more curious on the efficiency in your new implementation. Have you done any empirical performance test in semaphore vs task queue?
Empirical evidence was collected by running emulation tests with audio disabled and observing the frame rate. Evidence can also be observed in the slope change of the tearing patterns present.
@OtherCrashOverride I skimmed through the FreeRTOS implementation in semaphore and task queue. It looks like they are both calling the same queue API, which it implies two implementation are the same things. Thus, I doubt there is real benefit in terms of performance between semaphore and task queue. For record, my conclusion is based on 'skimming through' source code. It might be wrong.
Could you please tell me how you can do a benchmark test like the slope change and etc?
I did find that it computes FPS. But I want to know how can it only measure CPU 0 can tell CPU1 performance. Also, I wonder when spi_device_get_trans_result
blocking call is done (assuming max delay), does it imply DMA is done.
thanks
The new implementation is a rewrite. The performance difference is not due to semaphore/task. It is due to a new algorithm.
The video frame rate is controlled by the sound in all emulators. Disabling sound allows measurement of video performance. https://github.com/OtherCrashOverride/go-play/blob/b4d616390d9ad944e89c039bbd9aa11f504ea4eb/odroid-go-common/components/odroid/odroid_audio.c#L196-L202
To eliminate other factors, create a test program that just displays a blank frame and measures the performance.
That's a good idea. I will take a close look of your new implementation this weekend. Thanks for pointing it out. If possible, would you share the test program as well?
I found that you tried to use two ways to synchronize SPI DMA completion. Reading through ESP-IDF doc, there seem to be two ways to do this.
spi_device_queue_trans
function to send DMA transaction to the queue, invoke the corresponding number of spi_device_get_trans_result
with max ticks to wait.spi_device_interface_config_t
, set call back method in post_cb
member. The call back method synchronize frame buffer swap. I implemented method 1 in dual buffer. But there are scars in screen. It seems that spi_device_get_trans_result
returns before DMA complete. Thus, it overwrites what it is being DMA. ESP-IDF documentation does states that spi_device_get_trans_result
returns when transaction complete. But it is not clear if transaction completion is the same as DMA transmission completion. In any case, it is purely my guess. I don't have any good ways to verify it. I posted my questions to ESP forum and see if anyone encounter this problem.
PS: Can you tell me the mark down syntax to quote code like you did? I love it. But I can't find the syntax.
I do not have a test program to share. I do modifications in a branch (that is deleted) for testing.
As further evidence of the performance increase, the SMS/GG display functions now display the full 320x240 frame. Previously, they cropped lines due to performance.
There is no markdown to quote code. I just pasted the link to the code, and its automatically done by github.
Thanks for point it out. The current master branch used two methods to synchronize DMA.
spi_device_get_trans_result
I'm not sure which one is the right way to sync with DMA transmission yet. But using one should be good enough. Also, the number of spi_device_queue_trans
should match the number of spi_device_get_trans_result
.
Does the LCD have a VSYNC line that's exposed that the CPU can observe so we can just paint at the proper time? I'm guessing no?
the datasheet for the ILI9341V shows a VSYNC input... so is it just a matter of finding that wire on the LCD panel or flex connector and using some 'conductive glue' (I just ruined a flex connector a few months ago when I tried soldering onto it), then connecting that to a GPIO and rigging up the code to toggle?
Oh, it's an input? Often it's a signal from the driver... how does it work now if we aren't providing it with a VSYNC input at all? I'm not sure how it's an input unless the LCD hardware itself has double-buffered RAM or something.
Of course if the tearing isn't actually VSYNC related that might not help much. VSYNC can't hurt though.
@yyyc514 check the third reference to vsync in the PDF datasheet posted at the top of this issue. It is definitely an input, and specifies Fix to VDDI or VSS level when not in use.
so presumably it's just free-running. With this being the case, aside from needing the emulator to toggle a GPIO at the right time, the issue moves from the ILI9341V
to the actual LCD module being shipped. I checked hardkernel last night and didn't see an item number specified. I will have to open my unit up and try googling whatever text I find. I guess ordering a replacement screen is something I need to do, since testing/upgrading will likely involve cutting a trace and reworking it to emulator-controlled GPIO.
All, there aren't any more free pins on the ESP32. A pin would need to be taken away from something else. The only place I could think of are the pins on the external header. Maybe pin4 (IO15)? That should still be available even when using an external DAC I believe. I would rather see it fixed in software but the NES code is a mess. Perhaps someone can port a different emulator to the ESP32.
Oh and a build that takes a pin from the external header will never be officially supported or included in Go Play. Just something to think about.
@kamotswind the tearing has nothing to do with emulation. How would porting a different emulator fix this?
The VSYNC on the LCD controller is only available when the panel is being driven in parallel RGB mode. The ODROID-GO uses SPI mode and no sync signals are available.
[edit] FYI, there are not enough pins on ESP32 to drive the LCD in RGB mode.
Is the interlacing part still worked on? In my eyes this is the only way to reach 60 Hertz rendering.
This issue is addressed in a different branch. I will need to backport the work to the main branch. Tearing is significantly reduced and an improved scaler was also implemented. Anyone wanting to evaluate it can do so by building the following branch: https://github.com/OtherCrashOverride/go-play/tree/feature-nes-standalone
Tearing is significantly reduced and an improved scaler was also implemented.
Can you point to which commits theses were? I did try to skim the log but couldn't find it.
@yyyc514 I felt the same way.
The new scaler looks nice (odroid-go-common/components/odroid/odroid_display.c). Looks like instead of just mindlessly blending every adjacent pixel 50% that it actually figures out what % of which pixel should be blended... so since we're going from 256 -> 320 that very first native pixel would be 100% of NES pixel 1 but then the second pixel would be 25% NES pixel 1 and 75% NES color from pixel 2. At least that's my understanding from skimming.
Not sure I can understand what change reduced the tearing though.
@kamotswind Thanks for pointing it out.
After reading the diff, I have to say the coding style is far from ideal. Commenting out experimental code and careless spacing should and must be eliminated in Git. That's why I don't feel this project will welcome PR from others. To me, It is a disaster to work on any open source project like this.
Sorry about my rant. Back to the topic. If this diff does do the magic to reduce tearing, I can disable blending completely to verify it. Correct it if I'm wrong. But I doubt it is blending. I remembered I disabled blending before but tearing still showed up.
Regarding to my experiment on interlacing, it did not look that great at all on a small screen where you look at it closely for a long period of time. I gave up trying the 2nd approach which computes the minimum square (or squares) update between two consecutive frame. Perhaps I will try it out later on during Winter holiday.
As someone pointed this out, it is likely hardware limitation (no wiring to vsync) which makes software fix impossible to address tearing problem. If this is the case, we need to move on.
Just to note, I'm also working on this here: https://github.com/Cwiiis/go-play/tree/wip/cwiiis/smooth - I have dual goals of a partial updates/interlaced video update path and full-speed NES emulation. Currently my branch has interlaced, unscaled output on the NES only (the other emulators won't build at the moment).
Just to note, I've switched branches to https://github.com/Cwiiis/go-play/tree/wip/cwiiis/partial_updates and have been making some great progress. I've still got a few ideas to try out to get a bit more from this before tidying it up, implementing scaling and updating all the other emulators. I'm not sure that I'm going to be able to do any better than nearest-neighbour scaling, we'll have to see...
I want to reduce tearing problem in NES. I copied our discussion from odroid forum to here
How it works now
The way the display system currently works:
meanwhile on core 1:
meanwhile on DMA:
There are two frame buffers, two scanline buffers of 3 lines, and multiple DMA buffers.
At 40Mhz SPI, there is 40,000Kbps / 8 bits = 5KBps. A frame is 320 240 2 bytes / 1024 = 150KB. 5KBps / 150KB = 33.3 fps. However, because there is protocol overhead, the achieved frame rate is less (closer to 30fps). The LCD updates at 70fps (61 is the slowest but looked worse). 70fps / 30fps = 2.3 refreshes for every frame displayed.
LCD controller specs
IL9341 specs
Other's improvement on IL9341 over SPI
fbcp-ili9341 repo
My proposal