cortezaproject / corteza

Low-code platform
https://cortezaproject.org
Apache License 2.0
1.57k stars 360 forks source link

Optimise RBAC index memory pressure #1867

Open tjerman opened 1 month ago

tjerman commented 1 month ago

These are based on 10mio rules

Base is:

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=base_now_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10984989333 ns/op       5192429296 B/op 71882410 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       14.242s
tjerman commented 1 month ago

Trying out improving memory alignment

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=upd_alignment_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10459754625 ns/op       5192614672 B/op 71883998 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       13.720s

this is... kind of worse... difference negligible so it's whatever

10984989333/10459754625 = 1.05 5192429296/5192614672 = 0.99

tjerman commented 1 month ago

Removing some stuff from the structs we get...

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=simpler_structs_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10744250416 ns/op       5028433528 B/op 71887240 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       14.100s

10984989333/10744250416 = 1.02 5192429296/5028433528 = 1.03

marginally better but the bottom line... same difference. Makes sense since it was mostly just some pointers. Mayhaps values over pointers would work?

tjerman commented 1 month ago

using values we get...

➜  server git:(2024.3.x-rbac-opt-v2) ✗ go test -benchmem -memprofile=no_pointers_10mio.pprof -run=^$ -bench ^BenchmarkIndexBuild_10000000$ github.com/cortezaproject/corteza/server/pkg/rbac
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/rbac
BenchmarkIndexBuild_10000000-12                1        10557726250 ns/op       6178684144 B/op 71876321 allocs/op
PASS
ok      github.com/cortezaproject/corteza/server/pkg/rbac       14.007s

10984989333/10557726250 = 1.04 5192429296/6178684144 = 0.8

yeesh... yeah no

tjerman commented 4 weeks ago

tjerman commented 4 weeks ago

The TL;DR current change does the following:

  1. some wrapper around the index to hold stuff in memory
  2. some counter to keep track of misses

Theoretically, the most used bits will be in memory with fast access and all the rest will be in the database, accessed on demand (since DB access is generally slower, we can't just go 100% database).

For now, things are indexed/grouped by role and not by resource. The most memory-efficient version would hold the most used rules but we can't ensure that all needed rules re ready for evaluation (too many DB requests).

Indexing by most used resource vs most used role -- I might implement both as the underlying code shouldn't care too much re. what thing is used. In my head, some users/roles probably do the majority of operations so this is the most logical solution.

tjerman commented 4 weeks ago

Now... to do some benchmarking and do it in a way which won't be a pain to do...

Right now, we're looking for these things:

tjerman commented 4 weeks ago

Test benchmark

Now, I'm not entirely sure how to do this so this is a lil test... I think I'll do it like this

  1. get a profile of the index initializing
  2. use the original code + whatever extra is needed
  3. diff the two profiles
  4. profit

I'll set the repeat count to 1 since I want to see a per-op memory consumption

when looking at memory, I'll just look at the BenchmarkRbacBase function since I'll make things relative


roles: 3 rules: 10k per role

Base benchmark; create an index with one role:

Image

Mem: 13989 kB

Benchmark; add stuff to the index:

Mem: 19429 kB

Image

Diff:

Mem change: 5310 kB (19429/13989 ~= 1.38) The expected change would be 1.5 but since there is some overhead for the relevant services.

Image

tjerman commented 4 weeks ago

But these benchmarks provide aggregate values... will it be okay if I want to see how well updating the index/reindexing works?

A problem we're solving is that if we have a bunch of rules, re-indexing causes a big spike in memory (purely due to the sheer size). Is there even a real need to do memory profiling on this level? Since as is, the design limits the rules, the memory consumption is predictable.

tjerman commented 4 weeks ago

Trying out some code to report memory state programmatically...

CoPilot code +- some tweaks

func printMemStats(b *testing.B, label string) {
    runtime.GC()

    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    b.Logf("%s: Alloc = %v MiB, TotalAlloc = %v MiB, Sys = %v MiB, NumGC = %v\n",
        label, bToMb(m.Alloc), bToMb(m.TotalAlloc), bToMb(m.Sys), m.NumGC)
}
tjerman commented 4 weeks ago

We get this:

    rbac_rules_test.go:79: Before Benchmark: Alloc = 4415 kB, TotalAlloc = 8717 kB, Sys = 19361 kB, NumGC = 6
    rbac_rules_test.go:79: Index built: Alloc = 7070 kB, TotalAlloc = 14054 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:79: Index updated: Alloc = 9552 kB, TotalAlloc = 19112 kB, Sys = 24225 kB, NumGC = 9
    rbac_rules_test.go:79: After Benchmark: Alloc = 9552 kB, TotalAlloc = 19115 kB, Sys = 24225 kB, NumGC = 10

So these numbers seem about the same as they did using pprof, but it's a bit nicer to work with (at least for me). But this is after GC so I'll probably tweak it as so:

func printMemStats(b *testing.B, label string) {
    var m runtime.MemStats

    runtime.ReadMemStats(&m)
    b.Logf("%s pre gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)

    runtime.GC()
    b.Logf("%s post gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
}

With this I'll see the actual spike before anything gets cleaned up; so we get this:

    rbac_rules_test.go:78: Before Benchmark pre gc: Alloc = 4413 kB, TotalAlloc = 8669 kB, Sys = 19873 kB, NumGC = 5
    rbac_rules_test.go:82: Before Benchmark post gc: Alloc = 4413 kB, TotalAlloc = 8669 kB, Sys = 19873 kB, NumGC = 5
    rbac_rules_test.go:78: Index built pre gc: Alloc = 7861 kB, TotalAlloc = 14010 kB, Sys = 24481 kB, NumGC = 7
    rbac_rules_test.go:82: Index built post gc: Alloc = 7861 kB, TotalAlloc = 14010 kB, Sys = 24481 kB, NumGC = 7
    rbac_rules_test.go:78: Index updated pre gc: Alloc = 12124 kB, TotalAlloc = 19071 kB, Sys = 24481 kB, NumGC = 8
    rbac_rules_test.go:82: Index updated post gc: Alloc = 12124 kB, TotalAlloc = 19071 kB, Sys = 24481 kB, NumGC = 8
    rbac_rules_test.go:78: After Benchmark pre gc: Alloc = 9551 kB, TotalAlloc = 19078 kB, Sys = 28833 kB, NumGC = 9
    rbac_rules_test.go:82: After Benchmark post gc: Alloc = 9551 kB, TotalAlloc = 19078 kB, Sys = 28833 kB, NumGC = 9

We don't get much more info but I suppose this scenario doesn't really have much to GC

tjerman commented 4 weeks ago

I added a .Clear() function to the service and I get this

    rbac_rules_test.go:80: Before Benchmark pre gc: Alloc = 4427 kB, TotalAlloc = 8737 kB, Sys = 19617 kB, NumGC = 5
    rbac_rules_test.go:84: Before Benchmark post gc: Alloc = 4427 kB, TotalAlloc = 8737 kB, Sys = 19617 kB, NumGC = 5
    rbac_rules_test.go:80: Index built pre gc: Alloc = 8064 kB, TotalAlloc = 14077 kB, Sys = 23969 kB, NumGC = 7
    rbac_rules_test.go:84: Index built post gc: Alloc = 8064 kB, TotalAlloc = 14077 kB, Sys = 23969 kB, NumGC = 7
    rbac_rules_test.go:80: Index updated pre gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:84: Index updated post gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
    rbac_rules_test.go:84: Index cleared post gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
    rbac_rules_test.go:80: After Benchmark pre gc: Alloc = 4449 kB, TotalAlloc = 19152 kB, Sys = 28577 kB, NumGC = 10
    rbac_rules_test.go:84: After Benchmark post gc: Alloc = 4449 kB, TotalAlloc = 19152 kB, Sys = 28577 kB, NumGC = 10

Looking at this:

    rbac_rules_test.go:84: Index updated post gc: Alloc = 12137 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9
    rbac_rules_test.go:84: Index cleared post gc: Alloc = 9568 kB, TotalAlloc = 19147 kB, Sys = 28577 kB, NumGC = 9

runtime.GC() does nothing... why?

tjerman commented 4 weeks ago

... because I forgot to call runtime.ReadMemStats(&m)...

So..

func printMemStats(b *testing.B, label string) {
    var m runtime.MemStats

    runtime.ReadMemStats(&m)
    b.Logf("%s pre gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)

    runtime.GC()
    runtime.ReadMemStats(&m)
    b.Logf("%s post gc: Alloc = %v kB, TotalAlloc = %v kB, Sys = %v kB, NumGC = %v\n",
        label, bTokB(m.Alloc), bTokB(m.TotalAlloc), bTokB(m.Sys), m.NumGC)
}

gives

    rbac_rules_test.go:80: Before Benchmark pre gc: Alloc = 4430 kB, TotalAlloc = 8734 kB, Sys = 19873 kB, NumGC = 5
    rbac_rules_test.go:85: Before Benchmark post gc: Alloc = 4429 kB, TotalAlloc = 8737 kB, Sys = 19873 kB, NumGC = 6
    rbac_rules_test.go:80: Index built pre gc: Alloc = 7872 kB, TotalAlloc = 14072 kB, Sys = 23969 kB, NumGC = 7
    rbac_rules_test.go:85: Index built post gc: Alloc = 7079 kB, TotalAlloc = 14078 kB, Sys = 23969 kB, NumGC = 8
    rbac_rules_test.go:80: Index updated pre gc: Alloc = 12139 kB, TotalAlloc = 19138 kB, Sys = 24225 kB, NumGC = 8
    rbac_rules_test.go:85: Index updated post gc: Alloc = 9565 kB, TotalAlloc = 19146 kB, Sys = 28833 kB, NumGC = 9
    rbac_rules_test.go:80: Index cleared pre gc: Alloc = 9568 kB, TotalAlloc = 19148 kB, Sys = 28833 kB, NumGC = 9
    rbac_rules_test.go:85: Index cleared post gc: Alloc = 4445 kB, TotalAlloc = 19149 kB, Sys = 28833 kB, NumGC = 10
    rbac_rules_test.go:80: After Benchmark pre gc: Alloc = 4448 kB, TotalAlloc = 19152 kB, Sys = 28833 kB, NumGC = 10
    rbac_rules_test.go:85: After Benchmark post gc: Alloc = 4445 kB, TotalAlloc = 19153 kB, Sys = 28833 kB, NumGC = 11

🎉

tjerman commented 3 weeks ago

It might be worth looking into this change:

Instead of relying solely on roles, it might be worth indexing by resource and role. When I was discarding indexing by rules, the rationale was that many cases would still need database access (controlling memory consumption would be trivial then).

If we do roles and resources we could do the following:

Doing it like this would reduce the amount of rules while assuring fast lookups for common resources & roles.

Now... how do we change the change to do this...

Should we use two counters? One for roles, one for resources? Should we use one big index? I think it'll be a bit of a pain to figure out what rules to yank out of the index... would it be more sane to use multiple little indexes? One per resource might be a bit insane so let's consider one per role...

So we'd have the wrapper service with two counters; one for used roles and one for used resources. When some resource is determined to be in high demand, the lowest in demand resource would be yanked out and the new one would be indexed. The index would be updated with N most in-use roles.

Will it be a pain to know what to yank out? If we do multiple indexes, one per role, that'll be more of a pain. If we index per resource, we'll get some duplication which will also be a pain (I think)...

So I think that one big index will be enough; since we're dealing with resources, those aren't really hard to take out.

tjerman commented 3 weeks ago

Would it make more sense to combine the role & resource and use the resulting string to track usage? It'll probably use a bit more space but I think it'll be easier to do; so let's go with this one for now.

But now then... how do we figure out what resources we have available? We can't really rely on analysing rules since most of them won't be resource-specific (except for some scenarios). A cold start?

A combination of roles and resources? going through some records, modules, ... ? If we go down this route we'd definitely need to store this somewhere -- it's kind of expensive to do on each cold start...

tjerman commented 3 weeks ago

So... this is what we'll do:

The design will be as I've originally envisioned it, but instead of tracking usage based on roles (to see what needs to get indexed), we'll use roles & resources. Doing it like so should provide more control over how many RBAC rules and which specifically we keep in memory.

The only scenario (that I could think of) where using roles might be better is when each user has their own role (or a small set) and each role has a modest amount of RBAC rules attached (there isn't much composition going on). For this scenario, using roles, we'd probably get more index hits and less database usage.

With that said, mixing in resources into the equation we address a lot of the issues we'd encounter while keeping the original benefits (minus some memory reduction).

Upsides, downsides

Upsides

Downsides

Added points for improvement

Problems to address