Igalia / meta-webkit

Yocto / OpenEmbedded layer for WebKit based engines and browsers
MIT License
125 stars 69 forks source link

meta-webkit uses heigher CPU usage even after using HW Acceleration on imx8m mini quad #360

Closed Rutvij-dev closed 1 year ago

Rutvij-dev commented 2 years ago

Hi, I am using imx8m mini quad hardware with the latest BSP 5.10.72 - hard knot.

I have built meta-webkit below are my configurations in local.conf

IMAGE_INSTALL_append = " wpewebkit cog" PREFERRED_PROVIDER_virtual/wpebackend = "wpebackend-fdo"

After booting the system I am running the below command

cog --enable-media-stream=1 --platform=wl --set-permissions=all --enable-accelerated-2d-canvas=1 --enable-webgl=1 https://browserbench.org/JetStream/ &

Here, Obersrvations are,

In youtube, 720P achieved 30 FPS smooth video play with no artifacts, however higher CPU usage seems not justifiable.

Can anyone suggest?

clopez commented 2 years ago

It looks like accelerated hardware decoding of video is not working for you.

Can you try to reproduce a video (720p) with gstreamer directly? Use the command gst-play-1.0 for that.

If gstreamer uses also more CPU than expected to reproduce the video then this indicates you don't have correctly configured the VPU decoder drivers on your system or/and that the required gstreamer plugins to enable HW accelerated video decoding are missing.

AFAIK, on i.MX platforms this requires enabling the Kernel CODA drivers (and firmware) to access the decoder and then installing the right gstreamer plugins (gstreamer1.0-plugins-imx gstreamer1.0-plugins-imx-meta if using the vivante/propietary stack or gstreamer1.0-plugins-good if using the open-source/etnaviv stack)

Here we have some doc, but it is outdated and is for the previous generation (i.MX6) so likely some things have changed

Rutvij-dev commented 2 years ago

Hi @clopez,

Thank you for responding quickly.

Using GStreamer with hardware plugins it uses only 13% CPU usage and uses hardware encoder (v4l2h264dec) and GPU (glimagesink) without io mode in DMABUF. gst-launch-1.0 -v filesrc location=/home/root/output.mp4 ! qtdemux name=dd.video_0 ! queue ! h264parse ! v4l2h264dec ! imxvideoconvert_g2d ! queue ! glimagesink &

From logs, I conformed to VPU being used and from GPUTOP utility conformed GPU being used.

From kernel driver, CODA driver is supported to another family imx6 but in imx8m mini we have hantro_845_h1 driver. From Gstreamer, CPU usage is low < 20% which is expected.

Now even using,

cog --enable-media-stream=1 --platform=wl --set-permissions=all --enable-accelerated-2d-canvas=1 --enable-webgl=1 https://browserbench.org/JetStream/ &

While COG command is in progress below are the VPU logs I am getting in dmesg

[ 370.106602] mxc_hantro_845 38300000.vpu_g1: Reserve DEC Core, format = 11 [ 370.106608] mxc_hantro_845 38300000.vpu_g1: GetDecCoreID=1 [ 370.106626] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x40086b09 [ 370.106693] mxc_hantro_845 38300000.vpu_g1: flushed registers on Core 1 [ 370.106705] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0xc0086b0f [ 370.106713] mxc_hantro_845 38300000.vpu_g1: wait_event_interruptible DEC[1] [ 370.108102] mxc_hantro_845 38300000.vpu_g1: decoder IRQ received! Core 1 [ 370.108266] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x40086b09 [ 370.108336] mxc_hantro_845 38300000.vpu_g1: flushed registers on Core 1 [ 370.108343] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x00006b0c [ 370.108349] mxc_hantro_845 38300000.vpu_g1: Release DEC, Core = 1 [ 370.146063] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x00006b0b [ 370.146076] mxc_hantro_845 38300000.vpu_g1: Reserve DEC Core, format = 11 [ 370.146081] mxc_hantro_845 38300000.vpu_g1: GetDecCoreID=1 [ 370.146097] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x40086b09 [ 370.146159] mxc_hantro_845 38300000.vpu_g1: flushed registers on Core 1 [ 370.146171] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0xc0086b0f [ 370.146178] mxc_hantro_845 38300000.vpu_g1: wait_event_interruptible DEC[1] [ 370.147525] mxc_hantro_845 38300000.vpu_g1: decoder IRQ received! Core 1 [ 370.147639] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x40086b09 [ 370.147702] mxc_hantro_845 38300000.vpu_g1: flushed registers on Core 1 [ 370.147707] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x00006b0c [ 370.147712] mxc_hantro_845 38300000.vpu_g1: Release DEC, Core = 1 [ 370.186407] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x00006b0b [ 370.186422] mxc_hantro_845 38300000.vpu_g1: Reserve DEC Core, format = 11 [ 370.186428] mxc_hantro_845 38300000.vpu_g1: GetDecCoreID=1 [ 370.186447] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x40086b09 [ 370.186514] mxc_hantro_845 38300000.vpu_g1: flushed registers on Core 1 [ 370.186526] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0xc0086b0f [ 370.186533] mxc_hantro_845 38300000.vpu_g1: wait_event_interruptible DEC[1] [ 370.187883] mxc_hantro_845 38300000.vpu_g1: decoder IRQ received! Core 1 [ 370.188012] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x40086b09 [ 370.188079] mxc_hantro_845 38300000.vpu_g1: flushed registers on Core 1 [ 370.188087] mxc_hantro_845 38300000.vpu_g1: ioctl cmd 0x00006b0c [ 370.188093] mxc_hantro_845 38300000.vpu_g1: Release DEC, Core = 1

I could see GPU and VPU being used, however, CPU is still very high ~200%, this is the part not clear to me. And of course, even with these jetstream2 benchmark fails.

Are there any debug pointers available to dig into this issue? Is any porting document or some reference available, so that we can dig?

What must be the computational unit needs a higher CPU bandwidth?

Appreciate your answer.

clopez commented 2 years ago

The issue may be related to the negotiation of the plugins in the gstreamer pipeline.

Can you try reproducing the video with this command:

gst-play-1.0 --videosink glimagesink /path/to/video.mp4

(without forcing any specific plugins on the gstreamer pipeline)

Does it work as expected?

If that is the case then please attach the gstreamer logs and dot files when using cog, see here how to generate those logs

If the previous command doesn't work as expected then this may indicate that the default path gstreamer takes to decode the video is a non-accelerated one. Generating some logs and looking at which plugins are negotiated by default and why may help

Rutvij-dev commented 2 years ago

Hi @clopez,

Thank you for the direction.

I used this command as mentioned and below are my observations, gst-play-1.0 --videosink glimagesink /path/to/video.mp4

Further attached are the logs gstreamer logs and dot files when using cog as per the guidance link in log.zip - log.zip

Attached are the logs for VPU usage while running the above gst command - vpu_log.txt

Hope this suffices to get more in-depth details,

Appreciate your further guidance.

Rutvij-dev commented 2 years ago

Hi @clopez, Any debug pointers from the logs ?

clopez commented 2 years ago

If you check the dot files in the cog log you can see how the pipelines are built:

sudo apt-get install graphviz
# unpack zip with logs and convert the .dot to .pngs with
for x in *.dot; do dot -Tpng $x > $x.png; done
# open all pngs
firefox *.png

You can see that your pipeline is negotiating webm video instead of mp4, it seems is only using the mp4 for the audio but not for the video.

On the gst.log you can see this also by grepping for vp9. It ends decoding the video with the non-hardware-accelerated v4l2vp9dec0 gstreamer plugin

So the server that sends that video to your browser is for some reason sending webm/vp9 video instead of h.264 Perhaps is related to the user-agent that cogs sends to the server? (just a wild guess)

Questions:

clopez commented 2 years ago

Another thing worth trying:

Rutvij-dev commented 2 years ago

Hi @clopez,

Yes, you were right that it was using vp9 decoding, so I disabled that and enabled h264,

But even after doing this, I am seeing the same higher CPU %. I again observed those .dot files for the new setup (attaching here) this time it's showing h264 itself but not sure now which factor brings this higher CPU usage.

h264-dot-file.zip

philn commented 2 years ago

Can you generate pipeline graph dumps for cog please?

Rutvij-dev commented 2 years ago

Hi @clopez,

Apologizing in delayed response , Just I continued debugging and tried below pipelines ,

gst-launch-1.0 -v filesrc location=/home/root/BBB-1080p-30fps.mp4 ! qtdemux ! h264parse ! vpudec ! imxvideoconvert_g2d ! waylandsink -- > CPU 7% (expected)

Replaced same with glimagesink

gst-launch-1.0 -v filesrc location=/home/root/BBB-720p-30-fps.mp4 ! qtdemux ! h264parse ! vpudec ! imxvideoconvert_g2d ! glimagesink --> CPU 30%

Now I created sample server and html file from server if I play in cog with No Youtube to avoid vp8/9

cog 192.168.200.40 --> CPU 50% (same BBB-720p-30fps from local server) - yes H/W VPU with H264

This was strange and I did dig further and found webkitglvideosink plugin and videoscale and videoconversion operation.

So I believe, webkitglvideosink doesn't have same optimization as waylandsink or may be those scaling and conversation operation are heavy.

So can you suggest a way to use wayandsink instead of webkitglvideosink or may be we can remove unwanted scaling and conversion operation?

Appreciate your guidance.

-- Thanks

Rutvij-dev commented 2 years ago

Hi @clopez ,

Webkit uses playbin internally , which will use highest ranked plugin from gstreamer by GstTypeFindElement

Now I did 2 experiments,

  1. cog local filesystem video play (without audio )- No-Audio-BBB-720p-30-fps.mp4

$ cog No-Audio-BBB-720p-30-fps.mp4 Output % - CPU % 23 -26% Case-1-cog-local-fileplay.zip

  1. Created alike pipeline from playbin and forced to cog alike plugins with waylandsink ( replacing it with glimagesink is giving ~21% CPU - strange !)

    $ gst-launch-1.0 playbin video-sink="qtdemux ! vpudec ! videoconvert ! videoscale ! imxvideoconvert_g2d ! waylandsink" uri=file:///home/root/No-Audio-BBB-720p-30-fps.mp4 Output % - CPU % 8 -10% Case-2 - gstreamer-playbin-waylandsink-localplay.zip

Only difference I could see is webkitglvideosink and video conversion operation - happening in both the cases (correct me If i missed anything from analysis from .dot file)

For this experiment i removed possibilities for vp8/9 , Audio processing , network processing elements to reduce debug ROI.

Attaching case-1 and case-2 study .dot files.

But still I am not sure which plugin is causing this much of CPU.

Appreciate your guidance here.

Thanks Rutvij

clopez commented 2 years ago

WPE will always use glimagesink as gstreamer sink. It decodes the video to an OpenGL surface that then is used internally to composite the video as part of the web page.

So if you are seeing similar CPU usage using cog/wpe vs using gst-launch-1.0 with glimagesink as sink then there is no issue on cog itself, it is working as designed (AFAIK).

What I don't know is why your system is using more CPU to render the video to an OpenGL surface (glimagesink) than to a wayland surface. Seems something is not optimized there, but that is outside of the cog/wpe scope.

I suggest to investigate what is causing glimagesink to use more CPU, it really shouldn't be using more.

An idea that comes to mind is that you can maybe try using cog to render directly to the framebuffer (without using wayland at all). For doing that try starting cog with the parameter -P drm from a text console with wayland stopped. Example:

cog -P drm https://people.igalia.com/clopez/wkbug/video/simplevideo.html

Note: you have to enable the PACKAGECONFIG option drm for cog at build-time

philn commented 2 years ago

WPE will always use glimagesink as gstreamer sink.

No, WPE doesn't use glimagesink. It has a custom GL sink, but its behavior is similar to glimagesink. That's why we ask folks to first make sure glimagesink works in a standalone gst-play-1.0 pipeline.

Rutvij-dev commented 2 years ago

Hi @clopez,

My Cog flags has

-DCOG_PLATFORM_DRM=ON

cmake -DCMAKE_NO_SYSTEM_FROM_IMPORTED=1 -DCOG_DBUS_SYSTEM_BUS=OFF -DCOG_PLATFORM_DRM=ON -DCOG_PLATFORM_HEADLESS=OFF -DCOG_WESTON_DIRECT_DISPLAY=OFF -DCOG_PLATFORM_WL=ON -Wno-dev

However running with the -P drm shows below error,

(cog:16557): Cog-WARNING : 15:24:34.773: Platform setup failed: Failed to initialize DRM (cog:16557): CRITICAL : 15:24:34.775: WebKitWebViewBackend webkit_web_view_backend_new(wpe_view_backend, GDestroyNotify, gpointer): assertion 'backend' failed (cog:16557): Cog-ERROR : 15:24:34.775: Could not instantiate any WPE backend. Trace/breakpoint trap

I guess I am missing something but not sure what, can you guide ?

@philn , thanks for inputs.

gst-play-1.0 --videosink glimagesink No-Audio-BBB-720p-30-fps.mp4 - CPU 17%

cog No-Audio-BBB-720p-30-fps.mp4 - CPU 24%

Any comments ?

-- Thanks Rutvij

philn commented 2 years ago

Would need to be profiled. Not sure why the imxvideoconvert_g2d element is doing a conversion to RGBx, that doesn't look right, likely a perf bottleneck.

Rutvij-dev commented 2 years ago

Hi @philn,

Thanks for the reply ,

Yes certainly , I am also suspecting this here https://github.com/Igalia/meta-webkit/issues/360#issuecomment-1198317024 So basically I see two pain points,

  1. glimagesink vs waylandsink- glimagesink by default is taking some high side CPU % then that of waylandsink
  2. Why in imxvideoconvert_g2d video conversion on CPU happening ?

I will investigate them from my end, also appreciate if you guys can provide some debug areas to do it quickly.

@clopez , your feedback are also welcome.

Thanks

clopez commented 2 years ago

However running with the -P drm shows below error,

(cog:16557): Cog-WARNING : 15:24:34.773: Platform setup failed: Failed to initialize DRM (cog:16557): CRITICAL _: 15:24:34.775: WebKitWebViewBackend webkit_web_view_backend_new(wpe_viewbackend, GDestroyNotify, gpointer): assertion 'backend' failed (cog:16557): Cog-ERROR **: 15:24:34.775: Could not instantiate any WPE backend. Trace/breakpoint trap

I guess I am missing something but not sure what, can you guide ?

For using the drm backend your system needs a GPU driver that supports DRI/KMS. You should see a device named /dev/dri/card0 and another one named /dev/dri/renderD128.

I guess you are using the propietary Vivante driver, and this driver doesn't support that (AFAIK).

You would have to switch your build to use the open source drivers (Etnaviv).

There is a wpe backend for the propietary Vivante driver on the wpebackend-rdk backend (see packageconfig option imx6) but I'm not sure if this is still working, it has been quite a lot of time since I used it.

My advice would be that you try a build with the open source stack (Etnaviv GPU driver + gstreamer-v4l2 as hardware-enabled decoder and CODA kernel driver) to see if that works better.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stale for 7 days with no activity.