Open tjerman opened 9 months ago
The workflow is a simple iterator with 10k iterations:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12 1 4501291542 ns/op 3366690024 B/op 39866614 allocs/op
no. 1 bottleneck is with the state change handler; primarily JSON marshalling
JSON marshling happens in two places -- making stack traces (a primitive approach to deep cloning expr variables) and flushing to the database.
BM pre_iter_10000
Removing the state change handler to see what the best case scenario would be:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12 1 1001210917 ns/op 179357032 B/op 1040858 allocs/op
Comparing with the original we get this (speculation):
time 4501291542/1001210917 = 4.49 times better
space 3366690024/179357032 = 18.7 times better
allocations 39866614/1040858 = 38 times better
The next bottleneck seems to be in expression evaluation.
Investigating, we see pkg/expr
does a bunch of ops over strings and slices (calling for high memory pressure).
Can (probably) reduce the memory pressure by introducing some intermediate structure.
Benchmarking pkg/expr#PathSplit
we see:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkSplit-12 1948309 613.0 ns/op 4370 B/op 7 allocs/op
Introducing some aux structure to better traverse expressions, we go to this:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkPath-12 14496174 75.33 ns/op 0 B/op 0 allocs/op
Comparing the numbers:
time 613.0/75.33 = 8.13 times better
space 4370/0 = optimal space
allocations 7/0 = optimal space
We introduce this bad boy:
exprPath struct {
path string
i int
isLast bool
start, end int
}
You pass in an expression which gets cached along with some indexes and meta bits. When you request the next bit of the expression, the struct slices the original string. Utilizing go's memory model, this doesn't allocate any new memory since the underlying string remains the same -- just indexes change (this can also lead to memory leaks so be careful).
Updating expr.Assign
with the reworked structure, we go from
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkAssign-12 1811497 662.7 ns/op 4368 B/op 6 allocs/op
to
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkAssign-12 9167824 133.1 ns/op 0 B/op 0 allocs/op
Comparing the numbers:
time 662.7/133.1 = 4.97 times better
space 4368/0 = optimal space
allocations 6/0 = optimal space
Same iterator but the state change handler is turned off and expressions tweaked
Original numbers (without state change handler):
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12 1 1001210917 ns/op 179357032 B/op 1040858 allocs/op
We get to this:
time 1001210917/1001463750 = 0.99 times slower (marginable difference)
space 179357032/54809432 = 3.27 times better
allocations 1040858/930611 = 1.1 times better
Looking at the numbers (the time is practically the same, overall memory is slightly lower):
time 1001210917/1001463750 = 0.99 times slower (marginable difference)
space 179357032/54809432 = 3.27 times better
allocations 1040858/930611 = 1.1 times better
The seeming bottleneck now would be with var/state management. Per run we're using up 50MB which isn't that bad (I think).
We can't do much more as far as this is concerned. At some point we could rework expressions to rely less on strings, maps; but for now we'll leave it as it.
Taking a peek at the CPU profile, we see this:
This hints some locking/blocking is going on. My guess would be the locks we have on typed values (hindsight is 20/20; it was not 💔)
Grabbing an execution trace we see something interesting:
(I forgot to grab a screenshot of gaps between benchmark iterations)
Tested on some other benchmark and that one looks fine... I can't figure out exactly why this happens but everything points to how we utilize routines and channels under the hood.
Mutex profile is empty so that shouldn't be the problem.
Figuring it all out...
The random gaps were caused by sonyflake (since it can only generate 255 IDs per 10ms; https://github.com/sony/sonyflake). The gap between benchmark iterations is caused due to how we wait for results -- we have a ticker that runs every x ms.
(Sonyflake explained later)
(The issue with the waiter was temporarily mocked out by increasing the frequency)
With all this in place it now looks like this:
Original benchmark (excluding the state change handler)
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12 1 1001210917 ns/op 179357032 B/op 1040858 allocs/op
We get to this
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12 100 10258258 ns/op 5532122 B/op 93417 allocs/op
Comparing with the original we get this:
time 1001210917/10258258 = 97.6 times better
space 179357032/5532122 = 32.42 times better
allocations 1040858/93417 = 11.14 times better
Temporarily moving to product backlog to do some profiling on DAL
Looking into the state change handler and how the state values are cloned. Originally we used a JSON marshal into JSON unmarshal which is quite resource heavy.
Quickly drafted a solution to clone values and not use JSON stuff.
before:
136 11005555 ns/op 2573102 B/op 45278 allocs/op
After:
279 4009939 ns/op 907311 B/op 14388 allocs/op
Comparing thee numbers we see (speculative):
time 11005555/4009939 = 2.74 times better
space 2573102/907311 = 2.8 times better
allocations 45278/14388 = 3.1 times better
Each typed value now provides a .Clone()
function which returns a deep copy of it.
This avoids all the encoding, reducing memory footprint, and improving performance.
Some typed values now also utilize parallelism (slices and hash maps) to do it all faster.
After adding it ⭐ properly ⭐...
Before
136 11005555 ns/op 2573102 B/op 45278 allocs/op
After
270 4736002 ns/op 1561992 B/op 19903 allocs/op
Comparing the numbers:
time 11005555/4736002 = 2.32 times better
space 2573102/1561992 = 1.64 times better
allocations 45278/19903 = 2.27 times better
Still much better. There is a bit of impact de to parallelism which doesn't do much for this benchmark because of the smaller size. In the real world, with larger states, this should be better.
Might opt-in to remove it...
The issue with how stack traces currently work is every step makes a frame and pushes it to the stack. The problem occurs with large/tight loops as it leaves a gigantic memory footprint which can crash the system.
Before:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/automation/service
BenchmarkSessionStackTraces_1000-12 42758 28594 ns/op 87288 B/op 15 allocs/op
BenchmarkSessionStackTraces_10000-12 3430 348308 ns/op 1160442 B/op 23 allocs/op
BenchmarkSessionStackTraces_100000-12 229 5299188 ns/op 13317368 B/op 33 allocs/op
BenchmarkSessionStackTraces_1000000-12 19 52896750 ns/op 128431352 B/op 43 allocs/op
BenchmarkSessionStackTraces_10000000-12 3 436311292 ns/op 1202337018 B/op 53 allocs/op
After:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/automation/service
BenchmarkSessionStackTraces_1000-12 40142 29860 ns/op 120 B/op 4 allocs/op
BenchmarkSessionStackTraces_10000-12 4387 264930 ns/op 120 B/op 4 allocs/op
BenchmarkSessionStackTraces_100000-12 447 2735284 ns/op 120 B/op 4 allocs/op
BenchmarkSessionStackTraces_1000000-12 42 26596636 ns/op 120 B/op 4 allocs/op
BenchmarkSessionStackTraces_10000000-12 4 265553667 ns/op 120 B/op 4 allocs/op
Comparing the numbers we get this:
time (average of all times) 1.562 times better
space (...) = a bunch times better
allocations (average of all allocs) = 8.35 times better
IMPORTANT
This is a very specific scenario so the numbers might be a bunch more impressive in real life but will show up for larger workflows (such as processing thousands of records and such).
Stack traces now only preserve the last set of steps from the iteration. For example, when something fails or completes, only the last iteration will be preserved. This reduces memory but still provides enough insight into where we were and what we've done. This is more close to what other languages do.
Updating the workflow to utilize reworked stack traces (used a longer iterator for this case)
Before:
6 277497340 ns/op 73437206 B/op 552388 allocs/op
After
49 22680575 ns/op 9430545 B/op 151815 allocs/op
Comparing with the original we get this:
time 277497340/22680575 = 12.23 times better
space 73437206/9430545 = 7.78 times better
allocations 552388/151815 = 3.63 times better
Running the same (1k iterations) benchmark before everything and now:
Before:
4 305242125 ns/op 118081396 B/op 1278072 allocs/op
After:
52 23118595 ns/op 9559500 B/op 159828 allocs/op
Comparing with the original we get this:
time 305242125/22680575 = 13.45 times better
space 118081396/9430545 = 12.52 times better
allocations 1278072/151815 = 8.41 times better
Using a more sophisticated approach over an infinite loop ticking every x ns. The solution uses a channel which blocks the waiter until the workflow resolves.
When testing, no difference was found. Reverted to original solution with a tighter loop.
The current state:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/id
BenchmarkGenerator-12 30810 38893 ns/op 0 B/op 0 allocs/op
Running the most optimal version with a few concurrent routines caching their values to the queue, we'd get to (speculation)
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/id
BenchmarkGenerator-12 1000000 3260 ns/op 0 B/op 0 allocs/op
Comparing the numbers (speculation).
per sec 1000000/30810 = 32.45 times faster
time 38893/3260 = 11.9 times faster
space 0/0 = optimal space
allocations 0/0 = optimal space
Sadly, we can't do this since an important constraint of preserving transitivity between the IDs is broken. This would break our paging and all instances relying on newer bits have a higher ID.
Utilizing the cache with a single routine, we get:
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/id
BenchmarkGenerator-12 163213 39027 ns/op 0 B/op 0 allocs/op
Comparing the numbers, we get:
per sec 163213/30810 = 5.29 times more
time 38893/39027 = 0.99 times slower (marginable difference)
space 0/0 = optimal space
allocations 0/0 = optimal space
Interestingly, the time is about the same but the number of IDs per second is increased. This is probably due to routines, but can't say for sure why this happens.
The package now defines a global channel which is filled by go routines.
The Next()
function pulls from the channel.
When initializing the thing, we sleep for a bit to let the channel warm up.
Benchmark for assigning values to compose record values. Didn't grab a pre benchmark, added due to curiosity. The numbers look impressive enough so we don't need to bother for a while.
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/compose/automation
BenchmarkAssignToComposeRecordValues-12 22131764 53.22 ns/op 96 B/op 1 allocs/op
Benchmark for accessing compose record values. Didn't grab a pre benchmark, added due to curiosity. The numbers look impressive enough so we don't need to bother for a while.
PASS
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/compose/automation
BenchmarkRecordFieldValuesAccess-12 1947148 593.6 ns/op 544 B/op 19 allocs/op
PASS
Stale issue message
The current implementation has some issues/miss implementations that degrade system performance for larger/more complex logic.
Known issues
Potential issues
~- [ ] heavy mutex locks: typed values have a bunch of locks since they're pointers living in a parallel world. Not sure how much of an impact they have since these locks barely ever block so it might be whatever.~ (not an issue)
Additionally found issues