Open tjerman opened 3 months ago
Trying out improving memory alignment
➜ server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=upd_alignment_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12 1 10459754625 ns/op 5192614672 B/op 71883998 allocs/op
PASS
ok github.com/cortezaproject/corteza/server/pkg/rbac 13.720s
this is... kind of worse... difference negligible so it's whatever
10984989333/10459754625 = 1.05 5192429296/5192614672 = 0.99
Removing some stuff from the structs we get...
➜ server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=simpler_structs_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12 1 10744250416 ns/op 5028433528 B/op 71887240 allocs/op
PASS
ok github.com/cortezaproject/corteza/server/pkg/rbac 14.100s
10984989333/10744250416 = 1.02 5192429296/5028433528 = 1.03
marginally better but the bottom line... same difference. Makes sense since it was mostly just some pointers. Mayhaps values over pointers would work?
using values we get...
➜ server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=no_pointers_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12 1 10557726250 ns/op 6178684144 B/op 71876321 allocs/op
PASS
ok github.com/cortezaproject/corteza/server/pkg/rbac 14.007s
10984989333/10557726250 = 1.04 5192429296/6178684144 = 0.8
yeesh... yeah no
The TL;DR current change does the following:
Theoretically, the most used bits will be in memory with fast access and all the rest will be in the database, accessed on demand (since DB access is generally slower, we can't just go 100% database).
For now, things are indexed/grouped by role and not by resource. The most memory-efficient version would hold the most used rules but we can't ensure that all needed rules re ready for evaluation (too many DB requests).
Indexing by most used resource vs most used role -- I might implement both as the underlying code shouldn't care too much re. what thing is used. In my head, some users/roles probably do the majority of operations so this is the most logical solution.
Now... to do some benchmarking and do it in a way which won't be a pain to do...
Right now, we're looking for these things:
memory pressure when updating the index: this should be better since we don't re-index everything anymore
memory pressure when re-evaluating the entire index: this should be better since we're holding a smaller chunk; currently there isn't some super special way of making it easier on memory.
performance for evaluating unindexed roles: this should be fairly slower because of overhead
performance for evaluating partially unindexed roles: worst case in mind it'll be the same as unindexed so I might not benchmark this at all
Now, I'm not entirely sure how to do this so this is a lil test... I think I'll do it like this
I'll set the repeat count to 1 since I want to see a per-op memory consumption
when looking at memory, I'll just look at the BenchmarkRbacBase
function since I'll make things relative
roles: 3 rules: 10k per role
Mem: 13989 kB
Mem: 19429 kB
Mem change: 5310 kB (19429/13989 ~= 1.38) The expected change would be 1.5 but since there is some overhead for the relevant services.
But these benchmarks provide aggregate values... will it be okay if I want to see how well updating the index/reindexing works?
A problem we're solving is that if we have a bunch of rules, re-indexing causes a big spike in memory (purely due to the sheer size). Is there even a real need to do memory profiling on this level? Since as is, the design limits the rules, the memory consumption is predictable.
Trying out some code to report memory state programmatically...
CoPilot code +- some tweaks
func printMemStats(b *testing.B, label string) {
runtime.GC()
var m runtime.MemStats
runtime.ReadMemStats(&m)
b.Logf("%s: Alloc = %v MiB, TotalAlloc = %v MiB, Sys = %v MiB, NumGC = %v\n",
label, bToMb(m.Alloc), bToMb(m.TotalAlloc), bToMb(m.Sys), m.NumGC)
}
We get this:
rbac_rules_test.go:79: Before Benchmark: Alloc = 4415 kB, TotalAlloc = 8717 kB, Sys = 19361 kB, NumGC = 6
rbac_rules_test.go:79: Index built: Alloc = 7070 kB, TotalAlloc = 14054 kB, Sys = 24225 kB, NumGC = 8
rbac_rules_test.go:79: Index updated: Alloc = 9552 kB, TotalAlloc = 19112 kB, Sys = 24225 kB, NumGC = 9
rbac_rules_test.go:79: After Benchmark: Alloc = 9552 kB, TotalAlloc = 19115 kB, Sys = 24225 kB, NumGC = 10
So these numbers seem about the same as they did using pprof, but it's a bit nicer to work with (at least for me). But this is after GC so I'll probably tweak it as so:
func printMemStats(b *testing.B, label string) {
var m runtime.MemStats
runtime.ReadMemStats(&m)
b.Logf("%s pre gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
runtime.GC()
b.Logf("%s post gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
}
With this I'll see the actual spike before anything gets cleaned up; so we get this:
rbac_rules_test.go:78: Before Benchmark pre gc: Alloc = 4413 kB, TotalAlloc = 8669 kB, Sys = 19873 kB, NumGC = 5
rbac_rules_test.go:82: Before Benchmark post gc: Alloc = 4413 kB, TotalAlloc = 8669 kB, Sys = 19873 kB, NumGC = 5
rbac_rules_test.go:78: Index built pre gc: Alloc = 7861 kB, TotalAlloc = 14010 kB, Sys = 24481 kB, NumGC = 7
rbac_rules_test.go:82: Index built post gc: Alloc = 7861 kB, TotalAlloc = 14010 kB, Sys = 24481 kB, NumGC = 7
rbac_rules_test.go:78: Index updated pre gc: Alloc = 12124 kB, TotalAlloc = 19071 kB, Sys = 24481 kB, NumGC = 8
rbac_rules_test.go:82: Index updated post gc: Alloc = 12124 kB, TotalAlloc = 19071 kB, Sys = 24481 kB, NumGC = 8
rbac_rules_test.go:78: After Benchmark pre gc: Alloc = 9551 kB, TotalAlloc = 19078 kB, Sys = 28833 kB, NumGC = 9
rbac_rules_test.go:82: After Benchmark post gc: Alloc = 9551 kB, TotalAlloc = 19078 kB, Sys = 28833 kB, NumGC = 9
We don't get much more info but I suppose this scenario doesn't really have much to GC
I added a .Clear()
function to the service and I get this
rbac_rules_test.go:80: Before Benchmark pre gc: Alloc = 4427 kB, TotalAlloc = 8737 kB, Sys = 19617 kB, NumGC = 5
rbac_rules_test.go:84: Before Benchmark post gc: Alloc = 4427 kB, TotalAlloc = 8737 kB, Sys = 19617 kB, NumGC = 5
rbac_rules_test.go:80: Index built pre gc: Alloc = 8064 kB, TotalAlloc = 14077 kB, Sys = 23969 kB, NumGC = 7
rbac_rules_test.go:84: Index built post gc: Alloc = 8064 kB, TotalAlloc = 14077 kB, Sys = 23969 kB, NumGC = 7
rbac_rules_test.go:80: Index updated pre gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
rbac_rules_test.go:84: Index updated post gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
rbac_rules_test.go:84: Index cleared post gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
rbac_rules_test.go:80: After Benchmark pre gc: Alloc = 4449 kB, TotalAlloc = 19152 kB, Sys = 28577 kB, NumGC = 10
rbac_rules_test.go:84: After Benchmark post gc: Alloc = 4449 kB, TotalAlloc = 19152 kB, Sys = 28577 kB, NumGC = 10
Looking at this:
rbac_rules_test.go:84: Index updated post gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
rbac_rules_test.go:84: Index cleared post gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
runtime.GC()
does nothing... why?
... because I forgot to call runtime.ReadMemStats(&m)
...
So..
func printMemStats(b *testing.B, label string) {
var m runtime.MemStats
runtime.ReadMemStats(&m)
b.Logf("%s pre gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
runtime.GC()
runtime.ReadMemStats(&m)
b.Logf("%s post gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
}
gives
rbac_rules_test.go:80: Before Benchmark pre gc: Alloc = 4430 kB, TotalAlloc = 8734 kB, Sys = 19873 kB, NumGC = 5
rbac_rules_test.go:85: Before Benchmark post gc: Alloc = 4429 kB, TotalAlloc = 8737 kB, Sys = 19873 kB, NumGC = 6
rbac_rules_test.go:80: Index built pre gc: Alloc = 7872 kB, TotalAlloc = 14072 kB, Sys = 23969 kB, NumGC = 7
rbac_rules_test.go:85: Index built post gc: Alloc = 7079 kB, TotalAlloc = 14078 kB, Sys = 23969 kB, NumGC = 8
rbac_rules_test.go:80: Index updated pre gc: Alloc = 12139 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
rbac_rules_test.go:85: Index updated post gc: Alloc = 9565 kB, TotalAlloc = 19146 kB, Sys = 28833 kB, NumGC = 9
rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19148 kB, Sys = 28833 kB, NumGC = 9
rbac_rules_test.go:85: Index cleared post gc: Alloc = 4445 kB, TotalAlloc = 19149 kB, Sys = 28833 kB, NumGC = 10
rbac_rules_test.go:80: After Benchmark pre gc: Alloc = 4448 kB, TotalAlloc = 19152 kB, Sys = 28833 kB, NumGC = 10
rbac_rules_test.go:85: After Benchmark post gc: Alloc = 4445 kB, TotalAlloc = 19153 kB, Sys = 28833 kB, NumGC = 11
🎉
It might be worth looking into this change:
Instead of relying solely on roles, it might be worth indexing by resource and role. When I was discarding indexing by rules, the rationale was that many cases would still need database access (controlling memory consumption would be trivial then).
If we do roles and resources we could do the following:
Doing it like this would reduce the amount of rules while assuring fast lookups for common resources & roles.
Now... how do we change the change to do this...
Should we use two counters? One for roles, one for resources? Should we use one big index? I think it'll be a bit of a pain to figure out what rules to yank out of the index... would it be more sane to use multiple little indexes? One per resource might be a bit insane so let's consider one per role...
So we'd have the wrapper service with two counters; one for used roles and one for used resources. When some resource is determined to be in high demand, the lowest in demand resource would be yanked out and the new one would be indexed. The index would be updated with N most in-use roles.
Will it be a pain to know what to yank out? If we do multiple indexes, one per role, that'll be more of a pain. If we index per resource, we'll get some duplication which will also be a pain (I think)...
So I think that one big index will be enough; since we're dealing with resources, those aren't really hard to take out.
Would it make more sense to combine the role & resource and use the resulting string to track usage? It'll probably use a bit more space but I think it'll be easier to do; so let's go with this one for now.
But now then... how do we figure out what resources we have available? We can't really rely on analysing rules since most of them won't be resource-specific (except for some scenarios). A cold start?
A combination of roles and resources? going through some records, modules, ... ? If we go down this route we'd definitely need to store this somewhere -- it's kind of expensive to do on each cold start...
So... this is what we'll do:
The design will be as I've originally envisioned it, but instead of tracking usage based on roles (to see what needs to get indexed), we'll use roles & resources. Doing it like so should provide more control over how many RBAC rules and which specifically we keep in memory.
The only scenario (that I could think of) where using roles might be better is when each user has their own role (or a small set) and each role has a modest amount of RBAC rules attached (there isn't much composition going on). For this scenario, using roles, we'd probably get more index hits and less database usage.
With that said, mixing in resources into the equation we address a lot of the issues we'd encounter while keeping the original benefits (minus some memory reduction).
store.Storer
interfaces to access RBAC rules; it should be a wrapper around it so arbitrary implementations (external providers) could be plugged in.Stale issue message
Some bits need to check if a resource/role combo was added to the index... keeping another list would probably be insane so I'll probably offload that to the counter. I don't think that's that big of a deal since it wan't intended to be generic for anything (at least for now).
Do they though? Since access to the index is much faster compared to the DB, does it matter if I fetch missing bits from the DB while collecting bits from the index? Theoretically, the gained time should be too small to notice. Will do some benchmarks and see if we want to change.
Resource usage might drop off after a while so we'll need to add some TTL (or similar). Iirc there were some mechanisms to combat that (thanks uni) so I'll just use one of those bad boys.
I'll use a LRFU counter to combat all this. Might opt-in to log/track some extra stats to show reports. Using LRFU instead of just LRU or LFU since both introduce valid cases for Corteza where they would fall flat. LRFU provides an option to generically cover all cases or can be configured to cover either one or the other with a more/less aggressive λ.
If I feel prestige I might implement, along with some stats, a model to dynamically adjust the λ based on usage.
These are based on 10mio rules
Base is: