vc4: Low RetroArch performance

anholt commented 8 years ago

From a mail from a user:

Unfortunately, I did run into a fairly major issue when using the VC4 driver with RetroArch. RetroArch's GL/GLES rendering path is pretty well tested by now, so I'm guessing this is a VC4 issue. The problem is a performance issue, where more taxing games/emulators seem to run at approximately half the FPS (30 instead of 60 FPS) and less taxing (like Nestopia) vary between full FPS and half FPS. When exiting the emulator, the console is full of printouts regarding missed vblanks: http://i.imgur.com/0NY4voD.jpg

Need to identify a particular testcase and see what's going on.

Brunnis commented 8 years ago

Thanks for creating this issue. I'll try to describe the issue I've observed as best as I can.

First of all, RetroArch recently implemented a setting called video_max_swapchain_images. This setting controls the swap chain for Vulkan, but it also doubles as a setting to switch between double and triple buffering when using a DRM context. It so happens that setting this to 2 (double buffering) instead of 3 (triple buffering) reduces input lag. I'm by no means an expert on this, but I have a hard time seeing how a properly implemented triple buffering scheme would perform worse input lag wise. However, another LibRetro user confirmed on GitHub that a setting of "2" improved input lag by half a frame when using Intel graphics. I can personally confirm (through camera measurements) that a setting of "2" improves input lag by a full frame on Raspberry Pi with the VC4 graphics stack (whether using OpenGL or OpenGL ES path).

See this issue for discussion that lead to implementing this setting: https://github.com/libretro/RetroArch/issues/3100 And this commit (see changes to gfx/drivers_context/drm_ctx.c): https://github.com/libretro/RetroArch/commit/8a8fadc5b2e615fec0961d2e62e21cc731d653d1

Now for the actual issue: I've been testing the VC4 driver with RetroArch using both OpenGL and OpenGL ES paths. Both of these fail to run correctly with video_max_swapchain_images=2, i.e. double buffering. What happens is that emulation speed basically cuts in half. There may be dropped frames as well, but I believe it's primarily a case of slowdown.

Also, performance varies between full speed and half speed during emulation. It seems as if games/emulators that are hard to emulate will be more prone to running half speed most/all of the time. Games/emulators that are easier to emulate may run mostly at full speed, with sudden drops to half speed that may last for a few seconds before jumping back to full speed again.

When exiting the game, the console is full of printouts from RetroArch, complaining about missed vblanks. Example:

RetroArch [INFO] :: [KMS]: Missed 1 VBlank(s) (Frame: 2205, DRM frame: 250598).

Screenshot: http://i.imgur.com/8RodaMF.jpg

As you can see from the screenshot, every other vblank appears to be missed. This fits well with the observation that emulation runs at half speed.

Finally, changing over to video_max_swapchain_images=3, i.e. triple buffering, appears to mostly fix the issue. I've noticed a few printouts regarding missing vblanks, but games appear to run subjectively fine. Of course, using video_max_swapchain_images=3 adds a full frame of input lag and is completely unnecessary for this use case.

I've experienced this issue both when emulating SNES and NES, as well as when just running RetroArch's settings GUI ("RGUI").

For testing, I recommend building RetroArch yourself, since it allows you to make builds for both OpenGL and OpenGL ES as well as use RetroArch's plain "drm" video driver.

Required libraries:

libudev-dev libdrm-dev libgbm-dev libgl1-mesa-dev libgles2-mesa-dev

Move the old BCM libs out of RetroArch's reach. To do so, move /opt/vc to /opt/vc_backup. This way, RetroArch's build system doesn't find and link to the old libs.
Clone RetroArch git repository: sudo git clone https://github.com/libretro/RetroArch.git
cd into the folder with the RetroArch sources that was just created.

Configure RetroArch build:

./configure --enable-kms --enable-opengles --enable-plain_drm --disable-vg --disable-dispmanx --disable-x11 --disable-wayland --disable-sdl2 --disable-al --enable-neon --enable-floathard --disable-cheevos --disable-ffmpeg --disable-networking --disable-libretrodb --disable-materialui --disable-xmb --enable-udev --disable-sdl --disable-pulse --disable-oss --disable-freetype --disable-7zip --disable-libxml2 --disable-imageviewer --disable-rjpeg --disable-rpng --disable-rbmp --disable-rtga

Note: To make RetroArch use OpenGL instead of Open GL ES, change --enable-opengles to --enable-opengl in the configure command above.

make
Make a copy of RetroArch's default settings file: sudo cp retroarch.cfg retroarch.cfg.original
Edit RetroArch's configuration: sudo nano retroarch.cfg. Make sure the following lines are uncommented and that the values match:
```
video_driver = "gl"  
video_max_swapchain_images = "2"  
log_verbosity = "true"
```
Download desired emulator ("core") from RetroArchs nightlies: http://buildbot.libretro.com/nightly/linux/armhf/latest/ I've used snes9x_next_libretro and nestopia_libretro for my tests. Extract the emulator cores into the RetroArch folder.
Launch RetroArch with the following commands (replace _emulator_corename and _romname with the actual file names):
Settings GUI: ./retroarch -c retroarch.cfg
Game: ./retroarch -L _emulator_corename -c retroarch.cfg _romname

Additional info:

You do not need to launch RetroArch with an emulator and game to reproduce the issue. Just launch RetroArch's settings GUI and let it run for a while. Then exit to the console and find numerous printouts regarding missing vblanks.
Use Escape to exit RetroArch.
The arrow keys and 'z', 'x', 'a', 's' on the keyboard map to common game controller buttons.
To bypass the OpenGL path and use plain DRM instead, change video_driver setting in retroarch.cfg from "gl" to "drm".

Brunnis commented 8 years ago

Here's a small update which might be helpful:

I tried instrumenting the RetroArch code with timers to see if I could find a place where enough time was spent to miss a vblank. The (unmodified) RetroArch code I will be referring to below is located in:

https://github.com/libretro/RetroArch/blob/master/gfx/drivers_context/drm_ctx.c https://github.com/libretro/RetroArch/blob/master/gfx/common/drm_common.h

After a frame has been rendered, the function below (from drm_ctx.c) is called by RetroArch. A framebuffer flip is queued up by calling _gfx_ctx_drm_queueflip() (which in turn calls drmModePageFlip()):

static void gfx_ctx_drm_swap_buffers(void *data)
{
   gfx_ctx_drm_data_t *drm = (gfx_ctx_drm_data_t*)data;
   settings_t    *settings = config_get_ptr();

   switch (drm_api)
   {
      case GFX_CTX_OPENGL_API:
      case GFX_CTX_OPENGL_ES_API:
      case GFX_CTX_OPENVG_API:
#ifdef HAVE_EGL
         egl_swap_buffers(&drm->egl);
#endif
         break;
      default:
         break;
   }

   /* I guess we have to wait for flip to have taken 
    * place before another flip can be queued up.
    *
    * If true, we are still waiting for a flip
    * (nonblocking mode, so just drop the frame). */
   if (gfx_ctx_drm_wait_flip(drm->interval))
      return;

   waiting_for_flip = gfx_ctx_drm_queue_flip();

   /* Triple-buffered page flips */
   if (settings->video.max_swapchain_images >= 3 &&
         gbm_surface_has_free_buffers(g_gbm_surface))
      return;

   gfx_ctx_drm_wait_flip(true);  
}

After the call to _gfx_ctx_drm_queueflip(), _gfx_ctx_drm_waitflip(true) is called to wait for the flip to be signalled via the DRM file descriptor. This is the function that's used for polling the file descriptor and subsequently calling drmHandleEvent():

static INLINE bool drm_wait_flip(int timeout)
{
   g_drm_fds.revents = 0;

   if (poll(&g_drm_fds, 1, timeout) < 0)
      return false;

   if (g_drm_fds.revents & (POLLHUP | POLLERR))
      return false;

   if (g_drm_fds.revents & POLLIN)
   {
      drmHandleEvent(g_drm_fd, &g_drm_evctx);
      return true;
   }

   return false;
}

From what I can see when timing the code, the poll(&g_drm_fds, 1, timeout) call in the function above takes more than 16.67 ms. I measured ~22 ms. So it actually takes more than one frame period for the flip to be signalled.

EDIT: The behavior is really weird... It's almost as if a frame needs to be queued up a certain time before the flip event occurs for it to be used. And that certain time appears to be relatively long, i.e. several milliseconds. Setting RetroArch's video_max_swapchain_images setting to "3" enables an additional buffer (sort of triple buffering, but it's just a frame queue) that allows RetroArch to run ahead, which in turn enables it to queue up frames to the DRM earlier. This improves performance a lot and removes maybe 90-95% of the missed vblanks, but it's only a reasonably effective work-around and not an actual solution.

Also, as a sanity check, I just tested RetroArch with DRM/KMS on Ubuntu 16.04 with my Core i7-6700K using the integrated graphics. Using the same settings as on the Raspberry Pi, the Core i7 provides perfect performance and no missed vblank printouts on the console.

anholt commented 8 years ago

Thanks for the detailed explanations. It's been a busy week, hopefully I'll get to this soon.

Brunnis commented 8 years ago

Thanks Eric! I look forward to your findings.

In the meantime, I'll add a couple of additional observations:

The terminal printout informing of a missed Vblank actually worsens the issue, since printing to the terminal is really slow. Imagine the following case: The system is constantly on the edge of being able to churn out a frame in time for a page flip. A particularly demanding frame causes the system to (legitimately) miss a Vblank. This will cause a printout in the terminal. This prinout is really slow (several milliseconds) and the time taken to print the message eats right into the time available for the next frame. Since we were already on the edge of rendering each frame on time, the time lost in printing the message causes us to miss the next Vblank as well. Which in turn causes another slow printout, which... etc, etc.

While removing this RetroArch printout doesn't solve the base issue, it should probably be removed anyway, to prevent occasional missed flips from causing this prolonged state of low performance. For the measurements presented below, this printout was disabled.

I've also made another few tests with timers and counters. I measured the average poll() duration as well as the average frame render time excluding the poll() over each 1000 generated frames.

I ran the test during a game sequence that is not very demanding and subjectively appeared to run almost perfectly. The results:

Average poll() duration: 8.003 ms Average frame time: 9.263 ms Missed vblanks (1000 frames): 36

Only 3.6% of the Vblanks were missed and the sum of the average poll() duration and frame time is, expectedly, just north of one frame period.

Now, RetroArch has a nifty feature to delay the running of the emulator (including input polling) by up to 15 ms. So, after a flip event, RetroArch can pause for the specified amount of milliseconds and then run the emulator. This feature effectively reduces input lag by the same amount of time, since input polling can be pushed closer to when the frame will actually be queued up to the graphics driver. However, it obviously also requires a faster system, since it will have less time to generate the frame in time for the flip. The setting is called video_frame_delay in retroarch.cfg.

I tried performing the same measurements as I did above, but with a frame delay of just 1 ms:

Average poll() duration: 18.157 ms Average frame time: 10.608 ms Missed vblanks (1000 frames): 725

So, while the frame render time increased by ~1 ms as expected and is still far below 16.67 ms, the average poll() duration increased by 10 ms. The reason for this is obviously that we now miss the first Vblank for more than 70% of the frames.

These measurements don't really solve anything, but are further proof that something about queuing up frames and/or polling for flips needs a lot of time.

Brunnis commented 8 years ago

So, I had another idea today (yeah, when an issue bothers me I have trouble getting it out of my head)... I thought "What could be taking up so much time after a frame is queued up?". The first thing that came to mind is that the driver needs to process or copy the framebuffer in some way. Copying or otherwise processing a 1920x1080 or 1920x1200 framebuffer (those are the resolutions of the monitors I usually have my Pi 3 connected to) on this relatively weak hardware could actually take a considerable chunk out of our 16.67 ms frame period.

The simplest way of testing my theory would be to just lower the screen resolution to reduce the size of the framebuffer. I spent some time trying to change the resolution by modifying cmdline.txt (adding video= and various resolutions and refresh rates) without success (monitor complained of wrong timings). I ended up just connecting the Pi to my old 720p Samsung plasma. This thankfully worked and the Pi booted up in 1280x720.

My theory proved correct. Most of the slowdown went away and areas that were previously locked to 30 FPS now ran at almost constant 60 FPS. Only a few particularly demanding scenes in the game I use for testing remained slow.

Hope this is of some help. :-)

anholt commented 8 years ago

I started looking into things today. The build I've got, running ./retroarch -c retroarch.cfg against an unmodified retroarch.cfg, crashes after a few frames in gfx_ctx_drm_queue_flip() -> drm_fb_get_from_bo (bo=0x0).

anholt commented 8 years ago

OK, looks like forcing max swapchain to 2 avoids that. Found one little fix, and what looks like a basic frame preparation problem in RetroArch's backend: https://github.com/libretro/RetroArch/pull/3784

Brunnis commented 8 years ago

Thanks for having a look. I tried your fix, but I get strange results. First of all, with the code change the performance is definitely up, both objectively and subjectively. Subjectively, performance is slightly jerky and the framerate seems to "oscillate", for lack of better wording.

When running the code with my modifications that print statistics for every 1000 rendered frames, things again look... weird. Below are two photos of the output. In this test case I was just standing still with the character and nothing else was going on in the screen, i.e. frame render times should not have large variations.

Original code: http://i.imgur.com/cpA2Z2Q.jpg With fix: http://i.imgur.com/cpA2Z2Q.jpg

With the original code, we can see that out of 1000 frames, we missed 1000 vblanks, i.e. the game ran at 30 FPS. We can also see that the actual time to render a frame was on average 12.9 ms. The rest of the time was spent waiting for a flip to be signalled in the poll() call.

With the fix, we now only miss 190 vblanks. The average frame time is virtually identical compared to without the fix. However, the average time spent in poll() went down to just 7063 µs. What's strange about this? First of all, we still miss 1/5 of all vblanks, even though we have a pretty large margin in terms of frame render time.

What's really strange, though, is that the reported average poll() duration doesn't change with different frame render times. I've done several tests with different scenes and poll() duration is extremely stable at 7060 - 7080 µs, even when average frame render times vary by several milliseconds. Here's another screenshot that shows this behavior:

http://i.imgur.com/uQeepjb.jpg

Is the DRM_MODE_PAGE_FLIP_ASYNC flag really what we want to use here? I don't really know much about this, but a quick googling suggests that this flag causes the GPU to perform the flip ASAP instead of waiting for vblank, effetively disabling vsync. Could this be why the average poll() duration remains static? I.e. this flag causes the GPU to flip immediately but, for some reason, a flip actually takes ~7 ms to perform and we're stuck in the poll() call until it's done? This 7 ms figure ties in quite well with my previous measurements with the original code, in that ~10 ms is the maximum frame render time that is allowed before we start missing vblanks (16.67 - 7 ~= 10 ms). Given the strange performance I see when actually playing combined with the above observations, I'd wager that DRM_MODE_PAGE_FLIP_ASYNC screws up vsync and just somewhat hides the real problem.

I actually believe the RetroArch code is correct as is. I think everything so far points to the issue being that something in the generic kernel DRM code or the VC4 DRM driver requires a flip to be queued ~7 ms (atleast on a Raspberry Pi 3) before the actual flip is to be made by the GPU during vblank.

EDIT: I guess this could be easier to test in a more deterministic way using a simple application like Rob Clark's kmscube. From what I can see, kmscube should show the same issue if the frame render time is increased by adding a delay. The reason it runs fine now is just that it renders each frame far faster than the ~10 ms that seem to be required for things to start going south.

In that case, RetroArch should probably be removed from the issue heading, since it's more of general issue.

anholt commented 8 years ago

Crap. Yeah, I was thinking of the internal flag that got renamed to "nonblock" recently. The pageflip we're doing already sets that flag.

In other tests, I can render 1920x1080 at about 190fps, which is to say about 5 ms. So if your CPU side cost is 12ms, and then you hand off to the GPU that's going to take 5ms, you've exceeded your budget. You need the GPU and CPU to be doing work at the same time, which means you need to let the CPU proceed while the last frame is on the GPU. Triple buffering is one way to get that. The other would be to drop the gfx_ctx_wait_flip(true) in gfx_ctx_drm_swap_buffers() and delay it until the start of the next frame being handed to GL.

Themaister commented 8 years ago

Yes, if we have pure and hard double-buffer and CPU + GPU takes more than 16ms, it's not going to work, indeed. Otherwise, we need triple buffer to pipeline CPU/GPU.

Brunnis commented 8 years ago

In other tests, I can render 1920x1080 at about 190fps, which is to say about 5 ms. So if your CPU side cost is 12ms, and then you hand off to the GPU that's going to take 5ms, you've exceeded your budget.

Okay, so if I understand correctly those ~7 ms that currently appear to be needed between pushing a flip and actually completing it are simply what the GPU needs and there's just not much to do about it? I was kind of afraid of that.

It would have been cool if an extra buffer could have been avoided, since it shaves off a whole frame of input lag in the common case. I guess the best way of getting max_swapchain_images=2 to perform well is to simply go with an x86 setup. Any Bay Trail or faster hardware should probably be able to do it.

You need the GPU and CPU to be doing work at the same time, which means you need to let the CPU proceed while the last frame is on the GPU. Triple buffering is one way to get that. The other would be to drop the gfx_ctx_wait_flip(true) in gfx_ctx_drm_swap_buffers() and delay it until the start of the next frame being handed to GL.

Yep, triple buffering is already implemented and it is used when you set video_max_swapchain_images to 3. With this setting, RetroArch skips the wait after flip and allows the emulator to run ahead. I have no idea why it didn't work for you when you tried it, but I have successfully tested it with VC4 myself. It indeed masks the long GPU processing time and provides good performance, with one extra frame of input lag as previously mentioned.

anholt commented 8 years ago

The blocking on the last frame in double-buffer mode is only required to make sure that new GL rendering doesn't happen to the still-scanned-out frame. Given that RetroArch doesn't do any GL rendering to that buffer until very late in its frame preparation, it could delay blocking until then. But this is something that should be tracked in RetroArch's bug tracker, not here.

anholt / linux

vc4: Low RetroArch performance #57