Open pcmoore opened 7 years ago
I'm not sure this calls for something like the kernel's lib/rhashtable.c implementation, since the AVC is a cache, and I expect size adjustments to be rare, we can probably get away with throwing out the old table and replacing it with a new, empty table.
Do you really need to change the number of buckets, or just the threshold/max number of cache entries? The latter can already be tuned via /sys/fs/selinux/avc/cache_threshold. Do we have some data, e.g. cat /sys/fs/selinux/avc/hash_stats, from these systems?
I'm hearing of systems that have bumped the threshold up to ~65k and are hitting that limit, the resulting lengthy per-bucket chains are causing spikes in CPU usage in avc_has_perm().
Why would we end up with that many unique AVC entries? Most container accesses would be within the same category set (i.e. intra-container) and to a handful of types (mostly container or svirt types). So they shouldn't yield that many unique (source context, target context, target class) triples.
Imagine thousands of containers on a single system.
Even with thousands of containers, most accesses should be intra-container, so I wouldn't expect that many unique AVC entries; AVC entries are only ever created for actual permission checks, not potential ones. That said, given the number of unique security classes, I could see a definite multiplying factor to just represent a container's access to all file classes, many socket classes, etc. That's another area for possible improvement, i.e. allowing a single AVC entry and security server computation to represent multiple classes so that if the same permissions are allowed to e.g. all file classes, we can store that once in the AVC.
Most people will never run more then 100 containers. Eventually we might scale beyond 100, but I think on OpenShift right now, we are only handling ~50 containers. So this would be 50 Process types and 50 object types (Maybe a few more)
Agreed with the most people comment. OpenShift supports up to 250 per node right now, and going to try and double that by fall of this year. We currently have closing in on 100 per node in a variety of environments though. 100 is pretty common.
Then I don't see why we'd be increasing the AVC cache threshhold to 64k; that's just making the cache slow for no benefit.
At present the number of AVC hash buckets is hard coded to 512, we should look into making this tunable at runtime. While 512 buckets tends to work well for most workloads, it is proving to be too small for systems with a large number of unique labels such as container hosts using MCS/sVirt.