QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
532 stars 46 forks source link

Cache "system info" structure for qrexec policy evaluation #9362

Open marmarek opened 2 months ago

marmarek commented 2 months ago

How to file a helpful issue

The problem you're addressing (if any)

Currently qrexec-policy-daemon fetches info about all the qubes in the system every time a qrexec call is made. This structure intentionally do not include runtime state of the system, only its configuration (list of qubes and their select metadata) and as such changes rarely. Constructing this structure takes significant amount of time in scale of qrexec call setup time. In some tests, it was even 20ms out of 50ms total (on a system with many qubes).

The solution you'd like

Cache the "system info" structure, and invalidate this cache when any of the information changes. Theoretically, cache invalidation could be done selectively (like, invalidate single entries relevant to changed information), but such approach adds extra complexity and doesn't seem to be necessary - so, better invalidate the whole cached structure when something changes.

The cache could be implemented at the qrexec-policy-daemon side or at the qubesd side. The former might be more efficient, but the latter makes caching more reliable (especially the invalidation part). Tests shows no meaningful difference between those two approaches (on average difference below 1ms), so use the safer approach.

The value to a user, and who that user might be

Quicker qrexec connection time, relevant for users using qrexec for a large number of connections.

Completion criteria checklist

(This section is for developer use only. Please do not modify it.)

DemiMarie commented 2 months ago

:+1: on invalidating the whole cached structure when anything changes and on storing the cache in qubesd. It is possible that someday these choices (especially the first one) will become a bottleneck, but that is day is not today, unless someone posts a real-world workload (not a synthentic benchmark) for which this is a problem.