Optimize diffing redo - Githubissues

twop commented 2 years ago

Note depends on #36 fixes: https://github.com/TimLariviere/Fabulous-new/issues/17

Motivation

One of the goals of Fabulous is to be fast, in other words, produce minimal amount of overhead on top of underlying UI Framework (like XF or MAUI).

The way I see it we need to optimize 3 things:

An API that allows to stop diffing sub trees, the attempt to cover that is #36
A very efficient way to create Widget UI trees
A very efficient way to calculate and apply diff of prev vs next Widget UI trees

This PR is mostly concerned with 2 and 3

Approach

Allocate less
Take advantage of Comp Expression needed for building children and diffing to be internal details of the framework.
Use mutable data structures for things that are not observable to a user (such as diffing and CE) and use immutable optimized data structures that can be poked by a user.
Utilize stack allocated collections and data structures whenever makes sense

What is done

Stack allocated collections

MutStackArray1

Holds one value on the stack (only allocates for 2+ items)
Can be merged with another MutStackArray1, in that case both arrays are considered to be "consumed"
Intended to be used for internal things only, although it kinda leaks a bit when you need to declare Yield extension to work with Computation Expressions
Currently used inside CollectionBuilder for children and in Reconciler
As a consequence of using mutable resizable array as a backing data structure we don't have precise control of the size of the array, therefore there is a need to introduce a new data structures that is a tuple of (usedElmCount, array)

ArraySlice

Is exactly that (usedElmCount, array)
Often a result of MutStackArray1
Expressed as struct (uint16, array<'T>)

DiffBuilder

Very niche data structure used for ScalarAttribute diffing
The basic idea is to allocate on the stack an array of uint16 via stackalloc of fixed sized (8)
encodes a sequence of operation + index, (add | remove | change * index) , where operation is encoded into 2 bits and 14 is reserved for the index
Allocates dynamically on the heap more space if needed, the idea is to no to use this branch in 99,999% of the cases. Potentially we can easily allocate on the stack up to 12-16 elements. I thought 8 is a reasonable starting point
When the diffing is done it produced the final heap allocated array with diffs, the assumption that in most cases attribute diffing won’t produce a diff, thus avoids allocations entirely in majority of the cases

StackArray3

An array like data structure that allocates 3 elements on the stack or 4+ on the heap.
Immutable
Was the first implementation I tried for attribute storage for Widgets
Discovered a better alternative (StackList)
currently not used

StackList

An F# List like data structure
Immutable
Optimized for adding to the end (vs beginning of F# List)
Allocates 3 elements on the stack and then allocates a chunk that contains 3 elements when grows.
Example: Adding consequently 1, 2, 3, 4 to an empty StackList will result in
- Stack: 1
- Stack: 1, 2
- Stack: 1, 2, 3
- Heap: (1, 2, 3), Stack: 4 <— fist allocation
- Next allocation will be when adding the 7th element
Used for WidgetBuilder for storing Scalar attributes. The assumption that most Widgets won’t have more that 3 scalars most of the time, and the ones that do will allocate just 1-2 times (at 4th and 7th elements respectively)
Can be used for user facing things (Like WidgetBuilders)

Usage summary

StackList in WidgetBuilder
DiffBuilder for diffing scalar attributes
MutStackArray1 in CollectionBuilder, AttributeCollectionBuilder and the rest of the diffing in Reconciler
StackArray3 currently not used because it is inferior to StackList

Other changes

Widget, WidgetDiff now have either voption or option as fields to avoid allocations
In Reconciler diffing code is mostly preserved but optimized for allocations.
- Minor example: Array.tryFind allocates an Option if it holds a value type (structs), replaced with handwritten check
WidgetBuilder now has many constructors that mostly replace ViewHelper functions needed for that
Adjusted examples and Xamarin.Forms code accordingly (such as using MutStackArray1 in Yield)
- We might want to hide that under a helper method and make MutStackArray1 ctor internal for safety reasons. Although tricky with CE inlining
ArraySlice is now widely used, often in combination with MutStackArray1 as an output (for example in Reconciler)

Benchmark

Finally the numbers!

Not only the diffing got significantly faster but now it allocates almost %50 less memory.

Note that the memory numbers below are taken with a different growth function. Now the growth rate of MutStackArray1 is 1.5 (note that it is less than ResizeArray), it is possible that we should be even more conservative and use 1.3 because most of our collections are small.

Summary

ProcessMessages benchmark

Before

depth: 10
time: 183.9 ms
memory: 222 MB

---

depth: 15
time: 3,332.6 ms 
memory: 2,472 MB

After

depth: 10
time: 146.4 ms (-37.5 ms)
memory: 127 MB (-96 MB)

---

depth: 15
time: 2,941.6 ms (-391 ms)
memory: 1,412 MB (-1060 MB)

On main branch

Method	depth	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
ProcessMessages	10	183.9 ms	3.53 ms	3.93 ms	68333.3333	16000.0000	3333.3333	222 MB
ProcessMessages	15	3,332.6 ms	50.03 ms	44.35 ms	759000.0000	178000.0000	78000.0000	2,472 MB

After these changes

Method	depth	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
ProcessMessages	10	146.4 ms	2.11 ms	1.97 ms	43250.0000	17000.0000	1500.0000	127 MB
ProcessMessages	15	2,941.6 ms	49.67 ms	46.46 ms	460000.0000	166000.0000	67000.0000	1,412 MB

twop commented 2 years ago

Used ValueOption instead of option, and used Span instead of ArraySlice when appropriate

here is the latest results (note slight increase in memory consumption but slightly better perf)

Method	depth	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
ProcessMessages	10	147.5 ms	1.66 ms	1.55 ms	42750.0000	16500.0000	1500.0000	138 MB
ProcessMessages	15	2,910.4 ms	36.78 ms	34.40 ms	476000.0000	158000.0000	61000.0000	1,536 MB

TimLariviere commented 2 years ago

here is the latest results (note slight increase in memory consumption but slightly better perf)

Not sure if it's really worth it. We have slightly worse perf and memory on low depth, and barely noticeable win on perf compared to what we will pay in GC pause for bigger depth.

What do you think?

twop commented 2 years ago

here is the latest results (note slight increase in memory consumption but slightly better perf)

Not sure if it's really worth it. We have slightly worse perf and memory on low depth, and barely noticeable win on perf compared to what we will pay in GC pause for bigger depth.

What do you think?

I think it is, even though it is more memory it is easier GC and faster runtime perf. Note that on M1 cache misses are less severe because of large cache and really wide CPU lines.

So I think it is better, I'm curious how it is going to be on mobile devices

twop commented 2 years ago

@TimLariviere I believe that I resolves/fixed all comments. Please take a look once more time. Happy to fix any other issues/concerns

TimLariviere / Fabulous-new

Optimize diffing redo #37

Motivation

Approach

What is done

Stack allocated collections

Usage summary

Other changes

Benchmark

Summary

On main branch

After these changes