Closed cnuke closed 1 year ago
Thanks @cnuke for the great analysis and writeup. That's a huge win! We should definitely go for it if there are no security risks attached to it.
Fortunately, the cache maintenance operations are behind our ABI - so we could transparently change the implementation. @skalk, what do you think?
Commits 1cdd08e, 702dc98 and e919275 illustrate the basic idea by putting it into the base-library. I am not particular satisfied with the concret implementation, though (especically how to acquire proper knowledge about the cache line size and duplicating the code).
@cnuke @nfeske I've finally reviewed this issue again, and reworked the commit, which split the cache maintainance in between the different architectures once again. Now, it does not enter the kernel at all on x86 and riscv when using base-hw, and it only enters the kernel on ARMv6/7, and only for cache-line invalidation on ARMv8, which cannot be done in unprivileged mode. In contrast to the original commit it executes the cache_coherent
maintainance in user-land too. This is especially relevant in the Javascript context, and any other JIT context. However, I did not executed a real-worlds scenario for this JIT use case yet. Maybe, you can test it in the Morph browser context?!
@skalk thanks for the cleanup, I'll give it a try.
Thanks @skalk. I very much appreciate this change!
I just tested the commits with Morph on the PinePhone. It works well. Subjectively, the change seems to improve the responsiveness of the browser.
Merged to staging.
TL;DR executing
dc civac
in the driver rather than the kernel increases the FPS in theeglgears
test from 185 to 290 and decreases the the CPU time spent by the driver from 51% to 29%. Allowing it from user space seems worthwhile for such use-cases - is this something we should entertain?-- Running the
gpu_imx8q_evk.run
scenario witheglgears
shows a large amount of CPU time (around 51%) is spent by the driver and the FPS are capped at around 185. I used thecpu_sampler
with a sample rate of 100 ms to get a rough estimation on where the driver spends its time.As it turned out, it spend its time mostly in
Kernel::call(arg0, arg1, arg2)
, so I added some instrumentation inThread::_call
to collect the frequency of the performed system-calls. Call 14, cache_clean_inv_region, was the most frequent one by a wide margin. As the GPU buffer-objects are allocated from cached memory, with explicit cache management, this is not entirely unexpected. However, buffer-objects can be large, a few MiBs if it is the frame-buffer and the system-call operate on one 4 KiB memory page at a time.To test if this is could be a problem I allowed for
EL0
access by setting theUCI
bit inSctrlr_el1
to perform the operation in the driver. Indeed, its effect was quite noticeable as the FPS increased from 185 to 290 and the CPU time dropped from 51% to 29% (glmark2
exhibits similar improvements).