cortezaproject / corteza

Low-code platform
https://cortezaproject.org
Apache License 2.0
1.44k stars 335 forks source link

Optimize workflow execution engine #1682

Open tjerman opened 4 months ago

tjerman commented 4 months ago

The current implementation has some issues/miss implementations that degrade system performance for larger/more complex logic.

Known issues

Potential issues

~- [ ] heavy mutex locks: typed values have a bunch of locks since they're pointers living in a parallel world. Not sure how much of an impact they have since these locks barely ever block so it might be whatever.~ (not an issue)

Additionally found issues

tjerman commented 4 months ago

Benchmarks & Analysis

Simple workflow with sequence iterator

The workflow is a simple iterator with 10k iterations:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12                   1        4501291542 ns/op        3366690024 B/op 39866614 allocs/op

Image

no. 1 bottleneck is with the state change handler; primarily JSON marshalling

Image

JSON marshling happens in two places -- making stack traces (a primitive approach to deep cloning expr variables) and flushing to the database.

BM pre_iter_10000

tjerman commented 4 months ago

Removing the state change handler

Removing the state change handler to see what the best case scenario would be:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12                   1        1001210917 ns/op        179357032 B/op   1040858 allocs/op

Comparing with the original we get this (speculation):

time 4501291542/1001210917 = 4.49 times better
space 3366690024/179357032 = 18.7 times better
allocations 39866614/1040858 = 38 times better

Image

The next bottleneck seems to be in expression evaluation. Investigating, we see pkg/expr does a bunch of ops over strings and slices (calling for high memory pressure). Can (probably) reduce the memory pressure by introducing some intermediate structure.

tjerman commented 4 months ago

Processing expressions

Benchmarking pkg/expr#PathSplit we see:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkSplit-12        1948309           613.0 ns/op      4370 B/op          7 allocs/op

Introducing some aux structure to better traverse expressions, we go to this:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkPath-12        14496174                75.33 ns/op            0 B/op          0 allocs/op

Comparing the numbers:

time 613.0/75.33 = 8.13 times better
space 4370/0 = optimal space
allocations 7/0 = optimal space

Rework insight

We introduce this bad boy:

exprPath struct {
    path   string
    i      int
    isLast bool

    start, end int
}

You pass in an expression which gets cached along with some indexes and meta bits. When you request the next bit of the expression, the struct slices the original string. Utilizing go's memory model, this doesn't allocate any new memory since the underlying string remains the same -- just indexes change (this can also lead to memory leaks so be careful).

tjerman commented 4 months ago

Updating expr.Assign with the reworked structure, we go from

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkAssign-12       1811497           662.7 ns/op      4368 B/op          6 allocs/op

to

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/expr
BenchmarkAssign-12       9167824           133.1 ns/op         0 B/op          0 allocs/op

Comparing the numbers:

time 662.7/133.1 = 4.97 times better
space 4368/0 = optimal space
allocations 6/0 = optimal space
tjerman commented 4 months ago

Simple workflow without state change handler and reworked expressions

Same iterator but the state change handler is turned off and expressions tweaked

Original numbers (without state change handler):

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12                   1        1001210917 ns/op        179357032 B/op   1040858 allocs/op

We get to this:

time 1001210917/1001463750 = 0.99 times slower (marginable difference)
space 179357032/54809432 = 3.27 times better
allocations 1040858/930611 = 1.1 times better

Image

Looking at the numbers (the time is practically the same, overall memory is slightly lower):

time 1001210917/1001463750 = 0.99 times slower (marginable difference)
space 179357032/54809432 = 3.27 times better
allocations 1040858/930611 = 1.1 times better

The seeming bottleneck now would be with var/state management. Per run we're using up 50MB which isn't that bad (I think).

We can't do much more as far as this is concerned. At some point we could rework expressions to rely less on strings, maps; but for now we'll leave it as it.

tjerman commented 4 months ago

Trying to figure out why it's taking so long

Taking a peek at the CPU profile, we see this:

Image

This hints some locking/blocking is going on. My guess would be the locks we have on typed values (hindsight is 20/20; it was not 💔)

tjerman commented 4 months ago

Grabbing an execution trace we see something interesting:

(I forgot to grab a screenshot of gaps between benchmark iterations)

Image

tjerman commented 4 months ago

Tested on some other benchmark and that one looks fine... I can't figure out exactly why this happens but everything points to how we utilize routines and channels under the hood.

Mutex profile is empty so that shouldn't be the problem.

tjerman commented 4 months ago

Figuring it all out...

The random gaps were caused by sonyflake (since it can only generate 255 IDs per 10ms; https://github.com/sony/sonyflake). The gap between benchmark iterations is caused due to how we wait for results -- we have a ticker that runs every x ms.

(Sonyflake explained later)

(The issue with the waiter was temporarily mocked out by increasing the frequency)

With all this in place it now looks like this:

Image

Original benchmark (excluding the state change handler)

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12                   1        1001210917 ns/op        179357032 B/op   1040858 allocs/op

We get to this

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/tests/workflows
Benchmark_bench_exec_base-12                 100          10258258 ns/op         5532122 B/op      93417 allocs/op

Comparing with the original we get this:

time 1001210917/10258258 = 97.6 times better
space 179357032/5532122 = 32.42 times better
allocations 1040858/93417 = 11.14 times better
tjerman commented 4 months ago

Temporarily moving to product backlog to do some profiling on DAL

tjerman commented 4 months ago

Workflow state management

Looking into the state change handler and how the state values are cloned. Originally we used a JSON marshal into JSON unmarshal which is quite resource heavy.

Quickly drafted a solution to clone values and not use JSON stuff.

before:

     136          11005555 ns/op         2573102 B/op      45278 allocs/op

Image

After:

     279           4009939 ns/op          907311 B/op      14388 allocs/op

Image

Comparing thee numbers we see (speculative):

time 11005555/4009939 = 2.74 times better
space 2573102/907311 = 2.8 times better
allocations 45278/14388 = 3.1 times better

Implementation notes

Each typed value now provides a .Clone() function which returns a deep copy of it. This avoids all the encoding, reducing memory footprint, and improving performance. Some typed values now also utilize parallelism (slices and hash maps) to do it all faster.

tjerman commented 4 months ago

After adding it ⭐ properly ⭐...

tjerman commented 4 months ago

Before

136          11005555 ns/op         2573102 B/op      45278 allocs/op

After

270           4736002 ns/op         1561992 B/op      19903 allocs/op

Comparing the numbers:

time 11005555/4736002 = 2.32 times better
space 2573102/1561992 = 1.64 times better
allocations 45278/19903 = 2.27 times better

Still much better. There is a bit of impact de to parallelism which doesn't do much for this benchmark because of the smaller size. In the real world, with larger states, this should be better.

Might opt-in to remove it...

tjerman commented 4 months ago

Execution stack traces

The issue with how stack traces currently work is every step makes a frame and pushes it to the stack. The problem occurs with large/tight loops as it leaves a gigantic memory footprint which can crash the system.

Before:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/automation/service
BenchmarkSessionStackTraces_1000-12                42758             28594 ns/op           87288 B/op         15 allocs/op
BenchmarkSessionStackTraces_10000-12                3430            348308 ns/op         1160442 B/op         23 allocs/op
BenchmarkSessionStackTraces_100000-12                229           5299188 ns/op        13317368 B/op         33 allocs/op
BenchmarkSessionStackTraces_1000000-12                19          52896750 ns/op        128431352 B/op        43 allocs/op
BenchmarkSessionStackTraces_10000000-12                3         436311292 ns/op        1202337018 B/op       53 allocs/op

After:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/automation/service
BenchmarkSessionStackTraces_1000-12                40142             29860 ns/op             120 B/op          4 allocs/op
BenchmarkSessionStackTraces_10000-12                4387            264930 ns/op             120 B/op          4 allocs/op
BenchmarkSessionStackTraces_100000-12                447           2735284 ns/op             120 B/op          4 allocs/op
BenchmarkSessionStackTraces_1000000-12                42          26596636 ns/op             120 B/op          4 allocs/op
BenchmarkSessionStackTraces_10000000-12                4         265553667 ns/op             120 B/op          4 allocs/op

Comparing the numbers we get this:

time (average of all times) 1.562 times better
space (...) = a bunch times better
allocations (average of all allocs) = 8.35 times better

IMPORTANT

This is a very specific scenario so the numbers might be a bunch more impressive in real life but will show up for larger workflows (such as processing thousands of records and such).

Implementation notes

Stack traces now only preserve the last set of steps from the iteration. For example, when something fails or completes, only the last iteration will be preserved. This reduces memory but still provides enough insight into where we were and what we've done. This is more close to what other languages do.

tjerman commented 4 months ago

Updating the workflow to utilize reworked stack traces (used a longer iterator for this case)

Before:

       6         277497340 ns/op        73437206 B/op     552388 allocs/op

After

      49          22680575 ns/op         9430545 B/op     151815 allocs/op

Comparing with the original we get this:

time 277497340/22680575 = 12.23 times better
space 73437206/9430545 = 7.78 times better
allocations 552388/151815 = 3.63 times better
tjerman commented 4 months ago

Benching everything together

Running the same (1k iterations) benchmark before everything and now:

Before:

       4         305242125 ns/op        118081396 B/op   1278072 allocs/op

After:

      52          23118595 ns/op         9559500 B/op     159828 allocs/op

Comparing with the original we get this:

time 305242125/22680575 = 13.45 times better
space 118081396/9430545 = 12.52 times better
allocations 1278072/151815 = 8.41 times better
tjerman commented 4 months ago

Changing the workflow exec waiter to use channels

Using a more sophisticated approach over an infinite loop ticking every x ns. The solution uses a channel which blocks the waiter until the workflow resolves.

When testing, no difference was found. Reverted to original solution with a tighter loop.

tjerman commented 4 months ago

Improving ID generation

The current state:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/id
BenchmarkGenerator-12          30810         38893 ns/op           0 B/op          0 allocs/op

Running the most optimal version with a few concurrent routines caching their values to the queue, we'd get to (speculation)

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/id
BenchmarkGenerator-12        1000000          3260 ns/op           0 B/op          0 allocs/op

Comparing the numbers (speculation).

per sec 1000000/30810 = 32.45 times faster
time 38893/3260 = 11.9 times faster
space 0/0 = optimal space
allocations 0/0 = optimal space

Sadly, we can't do this since an important constraint of preserving transitivity between the IDs is broken. This would break our paging and all instances relying on newer bits have a higher ID.

Utilizing the cache with a single routine, we get:

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/pkg/id
BenchmarkGenerator-12         163213         39027 ns/op           0 B/op          0 allocs/op

Comparing the numbers, we get:

per sec 163213/30810 = 5.29 times more
time 38893/39027 = 0.99 times slower (marginable difference)
space 0/0 = optimal space
allocations 0/0 = optimal space

Interestingly, the time is about the same but the number of IDs per second is increased. This is probably due to routines, but can't say for sure why this happens.

Implementation notes

The package now defines a global channel which is filled by go routines. The Next() function pulls from the channel.

When initializing the thing, we sleep for a bit to let the channel warm up.

tjerman commented 4 months ago

Interesting benchmarks

Benchmark for assigning values to compose record values. Didn't grab a pre benchmark, added due to curiosity. The numbers look impressive enough so we don't need to bother for a while.

goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/compose/automation
BenchmarkAssignToComposeRecordValues-12     22131764            53.22 ns/op       96 B/op          1 allocs/op

Benchmark for accessing compose record values. Didn't grab a pre benchmark, added due to curiosity. The numbers look impressive enough so we don't need to bother for a while.

PASS
goos: darwin
goarch: arm64
pkg: github.com/cortezaproject/corteza/server/compose/automation
BenchmarkRecordFieldValuesAccess-12      1947148           593.6 ns/op       544 B/op         19 allocs/op
PASS
tjerman commented 4 months ago

EOF

github-actions[bot] commented 2 months ago

Stale issue message