base-hw/arm_v8: cache clean and invalidate from user space

cnuke commented 2 years ago

TL;DR executing dc civac in the driver rather than the kernel increases the FPS in the eglgears test from 185 to 290 and decreases the the CPU time spent by the driver from 51% to 29%. Allowing it from user space seems worthwhile for such use-cases - is this something we should entertain?

-- Running the gpu_imx8q_evk.run scenario with eglgears shows a large amount of CPU time (around 51%) is spent by the driver and the FPS are capped at around 185. I used the cpu_sampler with a sample rate of 100 ms to get a rough estimation on where the driver spends its time.

As it turned out, it spend its time mostly in Kernel::call(arg0, arg1, arg2), so I added some instrumentation in Thread::_call to collect the frequency of the performed system-calls. Call 14, cache_clean_inv_region, was the most frequent one by a wide margin. As the GPU buffer-objects are allocated from cached memory, with explicit cache management, this is not entirely unexpected. However, buffer-objects can be large, a few MiBs if it is the frame-buffer and the system-call operate on one 4 KiB memory page at a time.

To test if this is could be a problem I allowed for EL0 access by setting the UCI bit in Sctrlr_el1 to perform the operation in the driver. Indeed, its effect was quite noticeable as the FPS increased from 185 to 290 and the CPU time dropped from 51% to 29% (glmark2 exhibits similar improvements).

nfeske commented 2 years ago

Thanks @cnuke for the great analysis and writeup. That's a huge win! We should definitely go for it if there are no security risks attached to it.

Fortunately, the cache maintenance operations are behind our ABI - so we could transparently change the implementation. @skalk, what do you think?

cnuke commented 2 years ago

Commits 1cdd08e, 702dc98 and e919275 illustrate the basic idea by putting it into the base-library. I am not particular satisfied with the concret implementation, though (especically how to acquire proper knowledge about the cache line size and duplicating the code).

skalk commented 1 year ago

@cnuke @nfeske I've finally reviewed this issue again, and reworked the commit, which split the cache maintainance in between the different architectures once again. Now, it does not enter the kernel at all on x86 and riscv when using base-hw, and it only enters the kernel on ARMv6/7, and only for cache-line invalidation on ARMv8, which cannot be done in unprivileged mode. In contrast to the original commit it executes the cache_coherent maintainance in user-land too. This is especially relevant in the Javascript context, and any other JIT context. However, I did not executed a real-worlds scenario for this JIT use case yet. Maybe, you can test it in the Morph browser context?!

cnuke commented 1 year ago

@skalk thanks for the cleanup, I'll give it a try.

nfeske commented 1 year ago

Thanks @skalk. I very much appreciate this change!

nfeske commented 1 year ago

I just tested the commits with Morph on the PinePhone. It works well. Subjectively, the change seems to improve the responsiveness of the browser.

Merged to staging.

genodelabs / genode

base-hw/arm_v8: cache clean and invalidate from user space #4339