egnor / pivid

Experimental video code for Linux / Raspberry Pi
MIT License
28 stars 4 forks source link

"Slow update" and "No space left on device" eventually with multiple displays #7

Open egnor opened 2 years ago

egnor commented 2 years ago

Forum thread on "slow update", including multiple-display connection: https://forums.raspberrypi.com/viewtopic.php?t=332868

Forum thread on (what's probably) a different cause of "no space": https://forums.raspberrypi.com/viewtopic.php?t=333320

Observed: When doing nontrivial (maybe even trivial?) output to multiple displays, over time pivid starts complaining about "Slow update" events (KMS commit taking more than a frame). Over time this worsens until it starts reporting "No space left on device" (kernel rejects KMS commit), and visual artifacts become apparent and numerous.

The worsening appears to persist with the Pi in question (restarting pivid doesn't reset things), though it is reset (temporarily) by rebooting.

The easiest repro case is to run a multiple-output test like https://github.com/egnor/pivid/blob/main/test_media/jellyfish_2disp.json (from @aubilenon) and let it run for a few minutes; "Slow update" messages usually crop up after a loop or two.

egnor commented 2 years ago

Here's example from the simple jellyfish_2disp.json script, running at 60hz updates to exacerbate the problem...

At the start of this example, both displays have just had had a vsync page flip, finishing their previous updates:

19:45:49.582   4774u D [display] HDMI-2 u191 flip! 19:45:49.578 (m58237.839)
19:45:49.582     74u D [display] HDMI-1 u192 flip! 19:45:49.582 (m58237.842)

Note that the vsync events are ~3ms apart-- the two outputs generally are not synchronized even with the same video mode. Since HDMI-1 has become available, pivid immediately requests an update for the next vsync:

19:45:49.582     38u T [display] HDMI-1 status: Update done
19:45:49.582     36u D [display] UPDATE HDMI-1 1l mbw=8% cbw=18% lbm=25%
19:45:49.582     56u D [display]   pl84: fb343 1920x1080 NV12:BCOM:66004 f34=3.0M:1.9K/136K,1.9K/2.9M "jellyfish-3-mbps-hd-hevc.mkv @2.369" =>0,0
19:45:49.582     44u T [display]   #84 alpha = 65535
19:45:49.582     27u T [display]   #84 CRTC_ID = 94
19:45:49.582     25u T [display]   #84 CRTC_X = 0
19:45:49.582     25u T [display]   #84 CRTC_Y = 0
19:45:49.582     52u T [display]   #84 CRTC_W = 1920
19:45:49.582     77u T [display]   #84 CRTC_H = 1080
19:45:49.582     74u T [display]   #84 FB_ID = 343
19:45:49.582     74u T [display]   #84 SRC_X = 0
19:45:49.582     72u T [display]   #84 SRC_Y = 0
19:45:49.582     76u T [display]   #84 SRC_W = 125829120
19:45:49.582     58u T [display]   #84 SRC_H = 70778880
19:45:49.582     75u T [display]   HDMI-1 u193 committing...
19:45:49.583    313u D [display]   HDMI-1 u193 committed!

Note that the actual atomic commit request takes ~1ms. The hardware vsync was :49.582, it's now :49.583, we should have 15 more milliseconds before the next vsync. Similarly, on to HDMI-2:

19:45:49.583     87u T [display] HDMI-2 status: Update done
19:45:49.583     48u D [display] UPDATE HDMI-2 1l mbw=8% cbw=18% lbm=25%
19:45:49.583     34u T [display] HDMI-1 status: Update still pending
19:45:49.583    107u D [display]   pl95: fb343 1920x1080 NV12:BCOM:66004 f34=3.0M:1.9K/136K,1.9K/2.9M "jellyfish-3-mbps-hd-hevc.mkv @2.369" =>0,0
19:45:49.583    112u T [display]   #95 alpha = 65535
19:45:49.583     53u T [display]   #95 CRTC_ID = 105
19:45:49.583     28u T [display]   #95 CRTC_X = 0
19:45:49.583     26u T [display]   #95 CRTC_Y = 0
19:45:49.583     29u T [display]   #95 CRTC_W = 1920
19:45:49.583     63u T [display]   #95 CRTC_H = 1080
19:45:49.583     29u T [display]   #95 FB_ID = 343
19:45:49.583     25u T [display]   #95 SRC_X = 0
19:45:49.583     24u T [display]   #95 SRC_Y = 0
19:45:49.583     27u T [display]   #95 SRC_W = 125829120
19:45:49.583     26u T [display]   #95 SRC_H = 70778880
19:45:49.584     66u T [display]   HDMI-2 u194 committing...
19:45:49.584    137u D [display]   HDMI-2 u194 committed!

Again, the update request finished quickly. Now it's :49.584. HDMI-2's last hardware flip was :49.578, so we're 6ms in, but that still leaves 10ms until the next vsync. Now pivid is waiting for both updates to complete.

19:45:49.584     70u T [display] HDMI-2 status: Update still pending
19:45:49.588   4247u T [display] HDMI-1 status: Update still pending
19:45:49.589    854u T [display] HDMI-2 status: Update still pending
19:45:49.593   4231u T [display] HDMI-1 status: Update still pending
19:45:49.594    861u T [display] HDMI-2 status: Update still pending
19:45:49.598   4240u D [display] HDMI-1 u193 flip! 19:45:49.598 (m58237.859)
19:45:49.598     42u T [display] HDMI-1 status: Update done

OK, HDMI-1 flipped as expected-- at :49.582 + 16.6ms = :49.498. Note that HDMI-2 should actually have flipped first, since it's running about 3ms earlier. Anyway, now pivid has another update for HDMI-1 (but we're still waiting for the original update

19:45:49.598     38u D [display] UPDATE HDMI-1 1l mbw=8% cbw=18% lbm=25%
19:45:49.598     56u D [display]   pl84: fb351 1920x1080 NV12:BCOM:66004 f19=3.0M:1.9K/136K,1.9K/2.9M "jellyfish-3-mbps-hd-hevc.mkv @2.403" =>0,0
19:45:49.598     42u T [display]   #84 alpha = 65535
19:45:49.598     30u T [display]   #84 CRTC_ID = 94
19:45:49.598     27u T [display]   #84 CRTC_X = 0
19:45:49.598     25u T [display]   #84 CRTC_Y = 0
19:45:49.599     80u T [display]   #84 CRTC_W = 1920
19:45:49.599     82u T [display]   #84 CRTC_H = 1080
19:45:49.599     81u T [display]   #84 FB_ID = 351
19:45:49.599    121u T [display]   #84 SRC_X = 0
19:45:49.599    261u T [display]   #84 SRC_Y = 0
19:45:49.599     85u T [display]   #84 SRC_W = 125829120
19:45:49.599     81u T [display]   #84 SRC_H = 70778880
19:45:49.599     79u T [display]   HDMI-1 u195 committing...
19:45:49.599      0u T [display] Loading start 1920x1080 NV12:BCOM:66004 f28=3.0M:1.9K/136K,1.9K/2.9M "jellyfish-3-mbps-hd-hevc.mkv @2.703"
19:45:49.600   1040u D [display]   HDMI-1 u195 committed!

The HDMI-1 update took 2ms this time instead of 1ms, but it's still fast. So now we've completed HDMI-1's second update (since the start of this excerpt) but are still waiting for HDMI-2's first update, even though the vsync should have been several milliseconds ago.

19:45:49.600    106u T [display] HDMI-1 status: Update still pending
19:45:49.600     85u T [display] HDMI-1 status: Update still pending
19:45:49.600    135u T [display] HDMI-2 status: Update still pending
19:45:49.600    173u T [display] Loading framebuffer...
19:45:49.600    117u D [display] LOADED fb353 1920x1080 NV12:BCOM:66004 f28=3.0M:1.9K/136K,1.9K/2.9M "jellyfish-3-mbps-hd-hevc.mkv @2.703"
19:45:49.605   4657u T [display] HDMI-1 status: Update still pending
19:45:49.605    123u T [display] HDMI-2 status: Update still pending
19:45:49.605     83u W [player] s41 Slow update: 0.005s overdue

Pivid has started noticing that HDMI-2 (aka s41... different code layer, different identifier) has failed to flip within one frame time.

19:45:49.610   4869u T [display] HDMI-1 status: Update still pending
19:45:49.610    124u T [display] HDMI-2 status: Update still pending
19:45:49.610     82u W [player] s41 Slow update: 0.010s overdue
19:45:49.615   4899u D [display] HDMI-2 u194 flip! 19:45:49.612 (m58237.872)

And there's the HDMI-2 flip. The previous observed vsync was :49.578, this one is :49.612, that's 34ms later-- despite having (relatively speaking) all the time in the world, the Linux kernel layer failed to flip the display at the first vsync after the request, and had to wait until the second vsync. Of course since we're trying to push out 60fps, that means we've fallen a frame behind, and if that happens again (which it does) pivid will drop a frame to keep up.

So the question is why, when two displays are running, the Linux KMS layer takes more than a frame interval to flip a screen. Note that the framebuffer is already fully loaded and everything, it should just be a matter of changing some pointers.

egnor commented 2 years ago

Here's a similar example with kernel debugging on. Focusing on HDMI-2, here's pivid's logs (btw, HDMI-1 has slow updates just as often), showing a very similar frame-miss:

20:03:04.660   3885u D [display] HDMI-2 u127 flip! 20:03:04.658 (m59272.919)
20:03:04.661     28u D [display] UPDATE HDMI-2 1l mbw=8% cbw=18% lbm=25%
20:03:04.664    115u D [display]   pl95: fb344 1920x1080 NV12:BCOM:66004 f23=3.0M:1.9K/136K,1.9K/2.9M "jellyfish-3-mbps-hd-hevc.mkv @1.402" =>0,0
20:03:04.664     33u T [display]   #95 alpha = 65535
20:03:04.664     15u T [display]   #95 CRTC_ID = 105
20:03:04.664     13u T [display]   #95 CRTC_X = 0
20:03:04.664     12u T [display]   #95 CRTC_Y = 0
20:03:04.664     12u T [display]   #95 CRTC_W = 1920
20:03:04.664     12u T [display]   #95 CRTC_H = 1080
20:03:04.664     12u T [display]   #95 FB_ID = 344
20:03:04.664     14u T [display]   #95 SRC_X = 0
20:03:04.664     11u T [display]   #95 SRC_Y = 0
20:03:04.664     12u T [display]   #95 SRC_W = 125829120
20:03:04.664     11u T [display]   #95 SRC_H = 70778880
20:03:04.664     11u T [display]   HDMI-2 u130 committing...
20:03:04.666   2134u D [display]   HDMI-2 u130 committed!
20:03:04.666     33u T [display] HDMI-2 status: Update still pending
20:03:04.671     49u T [display] HDMI-2 status: Update still pending
20:03:04.676     47u T [display] HDMI-2 status: Update still pending
20:03:04.684     29u T [display] HDMI-2 status: Update still pending
20:03:04.684     57u T [display] HDMI-2 status: Update still pending
20:03:04.684     24u W [player] s41 Slow update: 0.001s overdue
20:03:04.689    103u T [display] HDMI-2 status: Update still pending
20:03:04.689     59u W [player] s41 Slow update: 0.007s overdue
20:03:04.694   3399u D [display] HDMI-2 u130 flip! 20:03:04.691 (m59272.952)

Enabling maximum kernel debugging (via pivid_play --debug_kernel), here's timestamp-corrected kernel logs (via pivid_inspect_kmsg) for the same time period:

20:03:04.665 🕸️ [drm:drm_ioctl [drm]] comm="pivid:play:s41" pid=13080, dev=0xe200, auth=1, DRM_IOCTL_MODE_ATOMIC
20:03:04.665 🕸️ [drm:drm_atomic_state_init [drm]] Allocated atomic state 60407942
20:03:04.666 🕸️ [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:95:plane-4] 31bb5a70 state to 60407942
20:03:04.666 🕸️ [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:105:crtc-4] cb391a47 state to 60407942
20:03:04.666 🕸️ vc4-drm gpu: [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:344] for [PLANE:95:plane-4] state 31bb5a70
20:03:04.667 🕸️ [drm:drm_atomic_check_only [drm]] checking 60407942
20:03:04.667 🕸️ [drm:drm_atomic_get_private_obj_state [drm]] Added new private object dc46d9ca state 4f787d17 to 60407942
20:03:04.667 🕸️ [drm:drm_atomic_get_private_obj_state [drm]] Added new private object ac0cbdf2 state 992e8714 to 60407942
20:03:04.667 🕸️ [drm:drm_atomic_nonblocking_commit [drm]] committing 60407942 nonblocking

So that's setting up the atomic commit, as expected...

20:03:04.675 🕸️ vc4-drm gpu: [drm:vblank_disable_fn [drm]] disabling vblank on crtc 4
20:03:04.678 🕸️ vc4-drm gpu: [drm:vc4_atomic_commit_tail [vc4]] Running the core clock at 282150000 Hz
20:03:04.678 🕸️ vc4-drm gpu: [drm:drm_calc_timestamping_constants [drm]] crtc 105: hwmode: htotal 2200, vtotal 1125, vdisply 1080
20:03:04.679 🕸️ vc4-drm gpu: [drm:vc4_atomic_commit_tail [vc4]] Core clock actual rate: 500000000 Hz
20:03:04.679 🕸️ vc4-drm gpu: [drm:drm_calc_timestamping_constants [drm]] crtc 105: clock 148500 kHz framedur 16666666 linedr 14814
20:03:04.679 🕸️ vc4-drm gpu: [drm:drm_vblank_enable [drm]] enabling vblank on crtc 4, ret: 0

Note for the above, crtc 4 (by index) is CRTC #105 (by ID) is HDMI-2 (by name). By this point (:04.679) we have already passed the vsync following the commit (would have been at :04.675). What is this extra code (vc4_atomic_commit_tail), and why do several milliseconds elapse between the actual commit operation and the "commit tail" which seems to adjust some clocks and enable the actual vsync interrupt?

Performing the commit in blocking mode (which waits for the vsync before returning from the ioctl) may make things easier to trace...

egnor commented 1 year ago

According to https://github.com/raspberrypi/linux/issues/5094#issuecomment-1193814948, updating both displays in the same commit may sidestep this problem. (I'd need to test it.) That would require a bit of a complicated rework, though!