vc4-drm : Are we losing hardware-accelerated blitting?

vanfanel commented 8 years ago

Hi,

I have been adapting other people's pupular programs (RetroArch, Scummvm, and a long etc) to use the dispmanx API for years now. However, dispmanx should, sooner or later, be superseeded by KMS/DRM, so I am adapting RetroArch to it.

With dispmanx, things were easy for blitting. Let's say I want to copy a 256x256 rect for a 298x256 buffer to a dismanx "resource" (buffer). Well, I have this GREAT dispmanx function for that:

vc_dispmanx_resource_write_data( DISPMANX_RESOURCE_HANDLE_T res, VC_IMAGE_TYPE_T src_type, int src_pitch, void * src_address, const VC_RECT_T * rect );

You see, I can pass a pitch, let's say 256*4 if I am blitting a pixel array with 4 bytes per pixel, and then it will be uploaded to the GPU without using the CPU for the transfer. Very fast and good solution!

But in KMS, using a dumb buffer, without an specific function to do it, I would have to copy a pitch of pixels each line, so in 256 lines I would be calling memcpy() 256 times to archieve the same. And that's for small rect...

So, I have seen that /usr/include/drm contains some hardware-specific implementations of blitting functions. Is there something similar on the Pi? I don't mind using IOCTLs or whatever is needed...

anholt commented 8 years ago

If you're using KMS directly, why not just map your dumb buffer and write the data directly into it in the first place, going from 1 or 2 copies with your current implementation, down to 0 copies?

vanfanel commented 8 years ago

@anholt : I fear that is not possible due to how RetroArch works. This is the prototype of the function on which RetroArch does an screen refresh:

static bool kms_gfx_frame(void *data, const void *frame, unsigned width,
      unsigned height, uint64_t frame_count, unsigned pitch, const char *msg)

The function has the same parameters always. I mean, it's how it works in RetroArch internally: it passes me (I am on the KMS side of things) a pointer to the pixel array. Not exactly optimal for my interests because it forces me to make at leas ONE copy: from the internal Retroarch pixel array to my dumb buffers. These buffers are mapped already so I simply memcpy() to them for now.

But that's not the problem now. The problem is blitting. As I said, in many cases RetroArch will send me a pixel array on which there are extra, not-meant-to-be-renderer pixels between scanlines. So I have to blit a rect extracted from another rect with a different pitch. That's what vc_dispmanx_resource_write_data() allowed me to do since it accepted a pitch, and copied only that pitch per line. Now, without vc_dispmanx_resource_write_data(), I would have to iterate over each line in the source pixels array, and copy only part of the line. That's a 256-iteration FOR loop per frame with the corresponding memcpy() in each iteration, and that when RetroArch is rendering a very low resolution system or game. In a 640x480 game/system (like scummvm running on libretro) I would have to do 480 memcpy() calls per frame, and so on. I am sure the hardware can do blitting, as it does on the dispmanx API. I really need a way to support that blitting (with a custom pitch) by the KMS/DRM system, to make for the vc_dispmanx_resource_write_data() loss.

anholt commented 8 years ago

Have you actually measured and found that the memcpy call overhead is a problem compared to a single memcpy? Because I bet you'll have a difficult time measuring a difference.

This should also be faster for the overall system than the dispmanx version. For shipping your pixels to dispmanx with that call, the kernel would need to pin the pages covering your area, look up their addresses and hand them across to the firmware, and then the firmware would do the loop of memcpys on the VPU. That's a lot of setup and communication overhead to get the same memcpy loop done on the same memory bus.

vanfanel commented 8 years ago

@anholt : Ok, made measurements and it's not that much really, at least on a Pi3 where memcpy() calls are faster. On a Pi1, there is a considerable performance penalty using this. However, I have seen these in my lsmod:

syscopyarea 2945 1 drm_kms_helper sysfillrect 3443 1 drm_kms_helper sysimgblt 2069 1 drm_kms_helper

syscopyarea? sysimgblt? Are you sure we don't have methods to do hardware blitting here? Maybe these are totally unrelated, but still I'd like to ask.

anholt commented 8 years ago

Those sys* are just for fbdev. We could probably accelerate fbdev using the dma engine, but nobody's built those helpers yet.

anholt commented 8 years ago

Actually, if you care about the number of memcpy loops, you could just make your drm_framebuffer have the stride that you want, and copy your whole buffer. vc4 doesn't have any restrictions on pitch alignment that I know of.

anholt / linux

vc4-drm : Are we losing hardware-accelerated blitting? #38