Optimise RBAC index memory pressure

tjerman commented 3 months ago

These are based on 10mio rules

Base is:

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=base_now_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10984989333 ns/op       5192429296 B/op 71882410 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       14.242s

tjerman commented 3 months ago

Trying out improving memory alignment

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=upd_alignment_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10459754625 ns/op       5192614672 B/op 71883998 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       13.720s

this is... kind of worse... difference negligible so it's whatever

10984989333/10459754625 = 1.05 5192429296/5192614672 = 0.99

tjerman commented 3 months ago

Removing some stuff from the structs we get...

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=simpler_structs_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10744250416 ns/op       5028433528 B/op 71887240 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       14.100s

10984989333/10744250416 = 1.02 5192429296/5028433528 = 1.03

marginally better but the bottom line... same difference. Makes sense since it was mostly just some pointers. Mayhaps values over pointers would work?

tjerman commented 3 months ago

using values we get...

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=no_pointers_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10557726250 ns/op       6178684144 B/op 71876321 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       14.007s

10984989333/10557726250 = 1.04 5192429296/6178684144 = 0.8

yeesh... yeah no

tjerman commented 3 months ago

The TL;DR current change does the following:

some wrapper around the index to hold stuff in memory
some counter to keep track of misses

Theoretically, the most used bits will be in memory with fast access and all the rest will be in the database, accessed on demand (since DB access is generally slower, we can't just go 100% database).

For now, things are indexed/grouped by role and not by resource. The most memory-efficient version would hold the most used rules but we can't ensure that all needed rules re ready for evaluation (too many DB requests).

Indexing by most used resource vs most used role -- I might implement both as the underlying code shouldn't care too much re. what thing is used. In my head, some users/roles probably do the majority of operations so this is the most logical solution.

tjerman commented 3 months ago

Now... to do some benchmarking and do it in a way which won't be a pain to do...

Right now, we're looking for these things:

memory pressure when updating the index: this should be better since we don't re-index everything anymore
memory pressure when re-evaluating the entire index: this should be better since we're holding a smaller chunk; currently there isn't some super special way of making it easier on memory.
performance for evaluating unindexed roles: this should be fairly slower because of overhead
performance for evaluating partially unindexed roles: worst case in mind it'll be the same as unindexed so I might not benchmark this at all

tjerman commented 3 months ago

Test benchmark

Now, I'm not entirely sure how to do this so this is a lil test... I think I'll do it like this

get a profile of the index initializing
use the original code + whatever extra is needed
diff the two profiles
profit

I'll set the repeat count to 1 since I want to see a per-op memory consumption

when looking at memory, I'll just look at the BenchmarkRbacBase function since I'll make things relative

roles: 3 rules: 10k per role

Base benchmark; create an index with one role:

Mem: 13989 kB

Benchmark; add stuff to the index:

Mem: 19429 kB

Diff:

Mem change: 5310 kB (19429/13989 ~= 1.38) The expected change would be 1.5 but since there is some overhead for the relevant services.

tjerman commented 2 months ago

But these benchmarks provide aggregate values... will it be okay if I want to see how well updating the index/reindexing works?

A problem we're solving is that if we have a bunch of rules, re-indexing causes a big spike in memory (purely due to the sheer size). Is there even a real need to do memory profiling on this level? Since as is, the design limits the rules, the memory consumption is predictable.

tjerman commented 2 months ago

Trying out some code to report memory state programmatically...

CoPilot code +- some tweaks

func printMemStats(b *testing.B, label string) {
    runtime.GC()

    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    b.Logf("%s: Alloc = %v MiB, TotalAlloc = %v MiB, Sys = %v MiB, NumGC = %v\n",
        label, bToMb(m.Alloc), bToMb(m.TotalAlloc), bToMb(m.Sys), m.NumGC)
}

tjerman commented 2 months ago

We get this:

    rbac_rules_test.go:79: Before Benchmark: Alloc = 4415 kB, TotalAlloc = 8717 kB, Sys = 19361 kB, NumGC = 6
    rbac_rules_test.go:79: Index built: Alloc = 7070 kB, TotalAlloc = 14054 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:79: Index updated: Alloc = 9552 kB, TotalAlloc = 19112 kB, Sys = 24225 kB, NumGC = 9
    rbac_rules_test.go:79: After Benchmark: Alloc = 9552 kB, TotalAlloc = 19115 kB, Sys = 24225 kB, NumGC = 10

So these numbers seem about the same as they did using pprof, but it's a bit nicer to work with (at least for me). But this is after GC so I'll probably tweak it as so:

func printMemStats(b *testing.B, label string) {
    var m runtime.MemStats

    runtime.ReadMemStats(&m)
    b.Logf("%s pre gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)

    runtime.GC()
    b.Logf("%s post gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
}

With this I'll see the actual spike before anything gets cleaned up; so we get this:

    rbac_rules_test.go:78: Before Benchmark pre gc: Alloc = 4413 kB, TotalAlloc = 8669 kB, Sys = 19873 kB, NumGC = 5
    rbac_rules_test.go:82: Before Benchmark post gc: Alloc = 4413 kB, TotalAlloc = 8669 kB, Sys = 19873 kB, NumGC = 5
    rbac_rules_test.go:78: Index built pre gc: Alloc = 7861 kB, TotalAlloc = 14010 kB, Sys = 24481 kB, NumGC = 7
    rbac_rules_test.go:82: Index built post gc: Alloc = 7861 kB, TotalAlloc = 14010 kB, Sys = 24481 kB, NumGC = 7
    rbac_rules_test.go:78: Index updated pre gc: Alloc = 12124 kB, TotalAlloc = 19071 kB, Sys = 24481 kB, NumGC = 8
    rbac_rules_test.go:82: Index updated post gc: Alloc = 12124 kB, TotalAlloc = 19071 kB, Sys = 24481 kB, NumGC = 8
    rbac_rules_test.go:78: After Benchmark pre gc: Alloc = 9551 kB, TotalAlloc = 19078 kB, Sys = 28833 kB, NumGC = 9
    rbac_rules_test.go:82: After Benchmark post gc: Alloc = 9551 kB, TotalAlloc = 19078 kB, Sys = 28833 kB, NumGC = 9

We don't get much more info but I suppose this scenario doesn't really have much to GC

tjerman commented 2 months ago

I added a .Clear() function to the service and I get this

    rbac_rules_test.go:80: Before Benchmark pre gc: Alloc = 4427 kB, TotalAlloc = 8737 kB, Sys = 19617 kB, NumGC = 5
    rbac_rules_test.go:84: Before Benchmark post gc: Alloc = 4427 kB, TotalAlloc = 8737 kB, Sys = 19617 kB, NumGC = 5
    rbac_rules_test.go:80: Index built pre gc: Alloc = 8064 kB, TotalAlloc = 14077 kB, Sys = 23969 kB, NumGC = 7
    rbac_rules_test.go:84: Index built post gc: Alloc = 8064 kB, TotalAlloc = 14077 kB, Sys = 23969 kB, NumGC = 7
    rbac_rules_test.go:80: Index updated pre gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:84: Index updated post gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
    rbac_rules_test.go:84: Index cleared post gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
    rbac_rules_test.go:80: After Benchmark pre gc: Alloc = 4449 kB, TotalAlloc = 19152 kB, Sys = 28577 kB, NumGC = 10
    rbac_rules_test.go:84: After Benchmark post gc: Alloc = 4449 kB, TotalAlloc = 19152 kB, Sys = 28577 kB, NumGC = 10

Looking at this:

    rbac_rules_test.go:84: Index updated post gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
    rbac_rules_test.go:84: Index cleared post gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9

runtime.GC() does nothing... why?

tjerman commented 2 months ago

... because I forgot to call runtime.ReadMemStats(&m)...

So..

func printMemStats(b *testing.B, label string) {
    var m runtime.MemStats

    runtime.ReadMemStats(&m)
    b.Logf("%s pre gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)

    runtime.GC()
    runtime.ReadMemStats(&m)
    b.Logf("%s post gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
}

gives

    rbac_rules_test.go:80: Before Benchmark pre gc: Alloc = 4430 kB, TotalAlloc = 8734 kB, Sys = 19873 kB, NumGC = 5
    rbac_rules_test.go:85: Before Benchmark post gc: Alloc = 4429 kB, TotalAlloc = 8737 kB, Sys = 19873 kB, NumGC = 6
    rbac_rules_test.go:80: Index built pre gc: Alloc = 7872 kB, TotalAlloc = 14072 kB, Sys = 23969 kB, NumGC = 7
    rbac_rules_test.go:85: Index built post gc: Alloc = 7079 kB, TotalAlloc = 14078 kB, Sys = 23969 kB, NumGC = 8
    rbac_rules_test.go:80: Index updated pre gc: Alloc = 12139 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:85: Index updated post gc: Alloc = 9565 kB, TotalAlloc = 19146 kB, Sys = 28833 kB, NumGC = 9
    rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19148 kB, Sys = 28833 kB, NumGC = 9
    rbac_rules_test.go:85: Index cleared post gc: Alloc = 4445 kB, TotalAlloc = 19149 kB, Sys = 28833 kB, NumGC = 10
    rbac_rules_test.go:80: After Benchmark pre gc: Alloc = 4448 kB, TotalAlloc = 19152 kB, Sys = 28833 kB, NumGC = 10
    rbac_rules_test.go:85: After Benchmark post gc: Alloc = 4445 kB, TotalAlloc = 19153 kB, Sys = 28833 kB, NumGC = 11

🎉

tjerman commented 2 months ago

It might be worth looking into this change:

Instead of relying solely on roles, it might be worth indexing by resource and role. When I was discarding indexing by rules, the rationale was that many cases would still need database access (controlling memory consumption would be trivial then).

If we do roles and resources we could do the following:

Determine most used resources
Determine most used roles that have rules for said resources
Pull all rules for the role resource combo

Doing it like this would reduce the amount of rules while assuring fast lookups for common resources & roles.

Now... how do we change the change to do this...

Should we use two counters? One for roles, one for resources? Should we use one big index? I think it'll be a bit of a pain to figure out what rules to yank out of the index... would it be more sane to use multiple little indexes? One per resource might be a bit insane so let's consider one per role...

So we'd have the wrapper service with two counters; one for used roles and one for used resources. When some resource is determined to be in high demand, the lowest in demand resource would be yanked out and the new one would be indexed. The index would be updated with N most in-use roles.

Will it be a pain to know what to yank out? If we do multiple indexes, one per role, that'll be more of a pain. If we index per resource, we'll get some duplication which will also be a pain (I think)...

So I think that one big index will be enough; since we're dealing with resources, those aren't really hard to take out.

tjerman commented 2 months ago

Would it make more sense to combine the role & resource and use the resulting string to track usage? It'll probably use a bit more space but I think it'll be easier to do; so let's go with this one for now.

But now then... how do we figure out what resources we have available? We can't really rely on analysing rules since most of them won't be resource-specific (except for some scenarios). A cold start?

A combination of roles and resources? going through some records, modules, ... ? If we go down this route we'd definitely need to store this somewhere -- it's kind of expensive to do on each cold start...

tjerman commented 2 months ago

So... this is what we'll do:

The design will be as I've originally envisioned it, but instead of tracking usage based on roles (to see what needs to get indexed), we'll use roles & resources. Doing it like so should provide more control over how many RBAC rules and which specifically we keep in memory.

The only scenario (that I could think of) where using roles might be better is when each user has their own role (or a small set) and each role has a modest amount of RBAC rules attached (there isn't much composition going on). For this scenario, using roles, we'd probably get more index hits and less database usage.

With that said, mixing in resources into the equation we address a lot of the issues we'd encounter while keeping the original benefits (minus some memory reduction).

Upsides, downsides

Upsides

More control over how many RBAC rules we keep in memory
we don't need to keep unused resources even though they belong to a role

Downsides

The usage counter will take up more memory since it'll need to store ~100 runes instead of an uint64 per entry (a combination of role ID + resource identifier). If we use a hashmap, that'll be about 100MB/1mio entries. We probably won't keep that many entries to began so it should be fine. We can also use some suffix tree to compress this a bit (co pilot sais it'll reduce memory to ~30MB/1mio entries).
It'll be a bit tricky to preload what resources are available since we can't purely rely on rules. I think we can just pull out random records and a few modules, and users at the end. We'll most likely need to flush the resulting states to the database so future startups won't be so random-esk. A cold start might be fine as well but that's TBD...

Added points for improvement

The wrapper should enable alternative usage tracking methods if we need to add more to cover/improve usecases
The wrapper should let users disable (regardless, we might need to keep some bits regarding context roles in memory) the index/counters all together to solely rely on the database to reduce memory usage as much as possible.
The wrapper should let users disable usage tracking to pull everything in the index (so it works like it does now) since this improvement only applies to large(r) systems.
The wrapper should not rely on the standard store.Storer interfaces to access RBAC rules; it should be a wrapper around it so arbitrary implementations (external providers) could be plugged in.

Problems to address

Resource usage might drop off after a while so we'll need to add some TTL (or similar). Iirc there were some mechanisms to combat that (thanks uni) so I'll just use one of those bad boys.
Some bits need to check if a resource/role combo was added to the index... keeping another list would probably be insane so I'll probably offload that to the counter. I don't think that's that big of a deal since it wan't intended to be generic for anything (at least for now).

github-actions[bot] commented 3 weeks ago

Stale issue message

tjerman commented 3 weeks ago

Some bits need to check if a resource/role combo was added to the index... keeping another list would probably be insane so I'll probably offload that to the counter. I don't think that's that big of a deal since it wan't intended to be generic for anything (at least for now).

Do they though? Since access to the index is much faster compared to the DB, does it matter if I fetch missing bits from the DB while collecting bits from the index? Theoretically, the gained time should be too small to notice. Will do some benchmarks and see if we want to change.

tjerman commented 3 weeks ago

Resource usage might drop off after a while so we'll need to add some TTL (or similar). Iirc there were some mechanisms to combat that (thanks uni) so I'll just use one of those bad boys.

I'll use a LRFU counter to combat all this. Might opt-in to log/track some extra stats to show reports. Using LRFU instead of just LRU or LFU since both introduce valid cases for Corteza where they would fall flat. LRFU provides an option to generically cover all cases or can be configured to cover either one or the other with a more/less aggressive λ.

If I feel prestige I might implement, along with some stats, a model to dynamically adjust the λ based on usage.

cortezaproject / corteza