Optimization ideas - Githubissues

dead-claudia commented 7 years ago

TL;DR: Let's do some heavy memory layout optimization to make GC more predictable and speed up creation/destruction. Less garbage is always nice, and engines can do some nice things if we let them.

See this gist for more details.

/cc @lhorie @pygy @tivac

dead-claudia commented 7 years ago

Note that this 100% concerns rendering, and shouldn't leave very many significant changes in other files, other than to update DOM nodes.

pygy commented 7 years ago

Another, very low hanging fruit is changing the IIFE style of the wrapper.

JS engines delay the actual parsing of functions until call time, to improve load times. On load, they just run a basic "validator" parser on functions that does not generate IR or machine code. This leads to faster load times, since often most functions are not needed on page load.

They have exceptions hardcoded for (function(){}())-style functions (and/or maybe (function(){})()? I think they dispatch on expressions that start with (function) to skip the validator and run the real thing immediately, but AFAIK, new function(){} will still be parsed fast then slow, leading to slower load times.

I'll have to find the exact incantations (a priori, if we use

pygy commented 7 years ago

I'd appreciate if you could wait for the (WIP) #1653 fix to be merged before optimizing.

It would also be nice to have tests for the various pool scenarios before we go wild on the engine.

dead-claudia commented 7 years ago

@pygy

I'd appreciate if you could wait for the (WIP) [...]

The only thing I have to merge is #1672, which is completely independent of #1653. I also have #1671, but that is mostly unrelated to this (it enables strict mode and enforces the linter in more places).

lhorie commented 7 years ago

Copying my comment from the gist:

My two cents:

1 - That "monstrosity" is the search space reduction algorithm for list diffs. Last I checked, none of the other libraries split this down, precisely for perf reasons. Can you provide some more concrete suggestions? I don't think "break into smaller functions so JIT can re-inline them into effectively the same thing" is a task worth pursuing.

2 - I've kept my hads off of that because afaik @pygy is working on that

3 - not sure I understand this one. You're saying we should instead copy everything from the new tree back to the old one? That makes no sense. For example, say you have data = [1,2,3] and in the view data.map(n => m("li", n)). Then you do data.shift(). This causes O(n) worth of copying when diffing, which is horrible.

4 - again, I don't understand the proposal. Are you saying we should create twice as many objects to enforce encapsulation of properties like .events?

5 - I don't understand this one either. Why is 4+ vnode types better than one? The vast majority of vnode properties apply to all vnode types, so conceptually dividing them doesn't really make any potential polymorphic types that much smaller, and it just ups the smartness requirement for the JIT engine.

6 - You keep saying vnodes are big, but they really aren't. They have 11 properties. By comparison, Inferno nodes have 8. FWIW, Dominic mentioned a while back that they went back to straight vdom, no vnode pooling. DOM pooling already happens with recycling. You're welcome to try to optimize it more but I already spent a fair amount of time fidgeting w/ that trying to lower vdom bench results (where recycling improvements can be tested most easily) and the numbers are very close to inferno already. At this point, I'm more interested in making sure it's correct than optimizing it

Anyways, I don't mean to just come here and shoot the whole thing down, so I'm going to add my own brain dump of perf-related things:

1 - First and foremost, I have done ZERO IrHydra work on Mithril. That's the lowest hanging fruit performance-wise. 2 - Mithril doesn't implement the kivi/ivi algorithm for reducing the number of DOM shuffles in large mostly non-sorted sorts. IMHO, this affects very few real use cases, but it is nonetheless a known optimization that has not been implemented. 3 - Mithril currently completely chokes on the asmjs validator. Making it not choke would likely uncover some addressable issues.

dead-claudia commented 7 years ago

I'll comment with a more detailed response on the rest later (doing this from mobile), but I'd be shocked if you can even get a useful vdom library that also validates as asm.js without resorting to something like Emscripten. And even then, goodbye to easy JS use, because the Emscripten JS API requires explicit memory management, too.

On Thu, Mar 2, 2017, 15:44 Leo Horie notifications@github.com wrote:

Copying my comment from the gist:

My two cents:

1 - That "monstrosity" is the search space reduction algorithm for list diffs. Last I checked, none of the other libraries split this down, precisely for perf reasons. Can you provide some more concrete suggestions? I don't think "break into smaller functions so JIT can re-inline them into effectively the same thing" is a task worth pursuing.

2 - I've kept my hads off of that because afaik @pygy https://github.com/pygy is working on that

3 - not sure I understand this one. You're saying we should instead copy everything from the new tree back to the old one? That makes no sense. For example, say you have data = [1,2,3] and in the view data.map(n => m("li", n)). Then you do data.shift(). This causes O(n) worth of copying when diffing, which is horrible.

4 - again, I don't understand the proposal. Are you saying we should create twice as many objects to enforce encapsulation of properties like .events?

5 - I don't understand this one either. Why is 4+ vnode types better than one? The vast majority of vnode properties apply to all vnode types, so conceptually dividing them doesn't really make any potential polymorphic types that much smaller, and it just ups the smartness requirement for the JIT engine.

6 - You keep saying vnodes are big, but they really aren't. They have 11 properties. By comparison, Inferno nodes have 8. FWIW, Dominic mentioned a while back that they went back to straight vdom, no vnode pooling. DOM pooling already happens with recycling. You're welcome to try to optimize it more but I already spent a fair amount of time fidgeting w/ that trying to lower vdom bench results (where recycling improvements can be tested most easily) and the numbers are very close to inferno already. At this point, I'm more interested in making sure it's correct than optimizing it

Anyways, I don't mean to just come here and shoot the whole thing down, so I'm going to add my own brain dump of perf-related things:

1 - First and foremost, I have done ZERO IrHydra work on Mithril. That's the lowest hanging fruit performance-wise. 2 - Mithril doesn't implement the kivi/ivi algorithm for reducing the number of DOM shuffles in large mostly non-sorted sorts. IMHO, this affects very few real use cases, but it is nonetheless a known optimization that has not been implemented. 3 - Mithril currently completely chokes on the asmjs validator. Making it not choke would likely uncover some addressable issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lhorie/mithril.js/issues/1669#issuecomment-283774457, or mute the thread https://github.com/notifications/unsubscribe-auth/AERrBAnE84gPo2IPluos2EB0NBDPRpVKks5rhyo_gaJpZM4MQwFE .

thysultan commented 7 years ago

@lhorie 3 - You can just mutate it to reflect the current state. i.e removing 3 from A A - [1,2,3], B - [1,2] --> A.pop(), adding 4, A - [1,2,3] B - [1,2,3,4] --> A.push(4), replace 2 A - [1,2,3] B - [1,4,3] --> A[1] = 4 where A is a persistent tree representing the vdom state.

pygy commented 7 years ago

@lhorie, re point 3, currently render mutates vnode based on old then throws away old. @isiahmeadows wants to do it the other way around, so that vnodes are not marked old by the GC. Generational GCs do frequent passes on young objects. Objects that survive n cycles (for small n in practice) are marked old and are skipped by the collector for these "young passes". Every N collection cycle, the GC does a full pass on all objects. These take more time and can cause perceived dropped frames.

Having younger objects means that they are collected early, which leads to smaller full passes and smoother frame rates.

pygy commented 7 years ago

@isiahmeadows re. 5, IIRC @lhorie benchmarked the cost of extra properties and it was deemed minor.

dead-claudia commented 7 years ago

@pygy re. 5 I'm speaking in terms of memory optimization, not speed optimization. Benchmarks aren't going to tell you that much, because they usually only render small trees. If you use the TodoMVC benchmark, you'd have to crank the list up to about 1000 to notice anything, because the DOM structure is so small. But heavier, real-world apps would notice this more.

A good example of this is with incremental-dom: it is near the bottom in raw CPU benchmarks, but it still remains responsive in reality due to lower memory constraints, and it especially shines on mobile for that very reason.

@lhorie Now that I'm here:

1 - That "monstrosity" is the search space reduction algorithm for list diffs. Last I checked, none of the other libraries split this down, precisely for perf reasons.

Examples?

Can you provide some more concrete suggestions? I don't think "break into smaller functions so JIT can re-inline them into effectively the same thing" is a task worth pursuing.

That was pretty much my concrete suggestion. For a concrete example, consider this Acorn patch from a few years ago. It split up a large function with complex logic into smaller ones that the engine could handle more easily. I can tell you V8 isn't alone, although it's the most affected.

Here's a few more concrete examples of how splitting this up could result in faster and easier-to-maintain code:

These two loops:
- A function + temporary object would move all the complicated logic syntactically out of the loop body, making it easier for VMs to predict branches.
- Those would become be a great place to search in IRHydra/etc. to ensure highly optimized assembly is happening, due to their sheer number of calls (a single average refresh can trigger hundreds of calls, even with matching trees).
- It also makes a cleaner place to audit individual calls.
Moving this branch out will boost readability, and could yield performance wins, but I'd have to profile first.
Pulling this segment out will boost readability, but I'd have to profile it first to see if it'd get inlined (it might with TurboFan).

2 - I've kept my hads off of that because afaik @pygy is working on that

That was my plan, too. Just noted it for relevance, and I left implicit the assumption I wouldn't be alone in tackling this, if we were to progress.

3 - not sure I understand this one. You're saying we should instead copy everything from the new tree back to the old one? That makes no sense. For example, say you have data = [1,2,3] and in the view data.map(n => m("li", n)). Then you do data.shift(). This causes O(n) worth of copying when diffing, which is horrible.

No. I just mean these two assignments should be going the other way, and from oninit to onremove, they should expect an identical vnode the same way they expect an identical vnode.state, no matter how many times they render.

4 - again, I don't understand the proposal. Are you saying we should create twice as many objects to enforce encapsulation of properties like .events?

It won't double anything unless you're exclusively creating/removing vnodes, never updating them (which never occurs in practice, and is a horribly inefficient way to use any vdom library). And no, it does not enforce encapsulation, because oninit and friends will still see that as the vnode argument. It just means this will never be true:

const C = {view(vnode) {
  if (tree == null) C.parent = vnode
  return C.tree = m("div")
}

m.render(document.body, m(C))
m.render(document.body, m(C))
assert.equal(C.parent, C.tree)

5 - I don't understand this one either. Why is 4+ vnode types better than one? The vast majority of vnode properties apply to all vnode types, so conceptually dividing them doesn't really make any potential polymorphic types that much smaller, and it just ups the smartness requirement for the JIT engine.

It's just 4, no more and no less.
V8 doesn't move to megamorphic types until about 8 or 9 type mismatches IIRC, and other engines are similar.
When you remove the extra properties, you'll see a significant GC drop due to fewer medium-sized allocations (and mildly reduced jank in stress tests - game devs will appreciate this).
The only persistent allocations are when you first render. Currently, you re-allocate new vnode contexts and copy the old state to the new state, which is rather wasteful.

6 - You keep saying vnodes are big, but they really aren't. They have 11 properties. By comparison, Inferno nodes have 8. FWIW, Dominic mentioned a while back that they went back to straight vdom, no vnode pooling. DOM pooling already happens with recycling. You're welcome to try to optimize it more but I already spent a fair amount of time fidgeting w/ that trying to lower vdom bench results (where recycling improvements can be tested most easily) and the numbers are very close to inferno already. At this point, I'm more interested in making sure it's correct than optimizing it

And how often do those vnodes need retained? If we can reduce memory usage on updates, I suspect we'll close in even more on Inferno (even if it's a small amount). Plus, let's consider our mobile users, where memory is a much bigger concern.

My big push has been trying to reduce memory where I can without slowing down overall CPU performance. And IMHO perf gains without tradeoffs are always nice. Even if CPU benchmarks are unaffected, reduced memory -> reduced GC -> improved responsiveness.

FibreFoX commented 7 years ago

There is a little benchmark-project, which includes mithril (maybe in some previous version) and contains some memory-checks, example-report here: https://rawgit.com/krausest/js-framework-benchmark/master/webdriver-ts/table.html

Maybe it would be a good idea to specificly create a test or benchmark for memory-overhead?

dead-claudia commented 7 years ago

Anyways, I don't mean to just come here and shoot the whole thing down, so I'm going to add my own brain dump of perf-related things:

1 - First and foremost, I have done ZERO IrHydra work on Mithril. That's the lowest hanging fruit performance-wise.

Obviously. And there are some key things in the renderer that we can reasonably guess should show up somewhere, even in just the renderer tests, like the create* functions. In fact, I would be shocked if insertNode wasn't inlined in most of its call sites, and it wouldn't be surprising if some of the create* functions were themselves inlined in places.

2 - Mithril doesn't implement the kivi/ivi algorithm for reducing the number of DOM shuffles in large mostly non-sorted sorts. IMHO, this affects very few real use cases, but it is nonetheless a known optimization that has not been implemented.

True, but I agree with you that this really isn't worth implementing.

3 - Mithril currently completely chokes on the asmjs validator. Making it not choke would likely uncover some addressable issues.

Already addressed.

And one more point on 6: this could be moved to the vnode.state.view call (which should've been done in the first place IMHO).

dead-claudia commented 7 years ago

@FibreFoX

Maybe it would be a good idea to specificly create a test or benchmark for memory-overhead?

Maybe, but I feel we may have to specially construct that kind of test. The closest I've found to a decent usable benchmark is this (a variant is already checked in here, live version here), but I still feel that's somewhat lacking, since that never adds/removes nodes (just updates with random changes).

Here's what I'd like to see in a benchmark:

Test additions/deletions
- Obvious
Test unweighted randomized updates/shuffles, with occasional tree/tag changes
- Tests for fast random replacement with high number of changes
Test weighted random updates/shuffles with occasional tree/tag changes
- Tests for fast semi-random replacement with high number of changes
Test repeated keyed incremental sorting of random set, with small, random N steps
- Tests for fast, semi-random and adaptive replacement with low number of changes
- Quicksort is a good algorithm for this, since it is fast for each individual step, it has a good mix of near and far moves, and it's easily done in place.
Test repeated keyed reversal
- Tests for fast adaptive replacement
Test above transforms, randomly selected for small, random N consecutive times per iteration
- Allows adaptation, but makes gaming the benchmarks harder.
- Independent from above, but fewer iterations
Each iteration run with low timeout, to allow for async rendering
Verify frame rate, memory usage (at least in Chrome), and run time.

Of course, this may require significant optimization to avoid burying the frame rate, so I'd probably write my own PRNG using the xorshift32 algorithm and write parts of the benchmark code in asm.js.

Problem is, everyone is making benchmarks that are too simple, but real-world uses require better stress testing than just pushing everything through the same path each time. Oh, and the last test (combines all of them), I expect could bog down several frameworks. And the only way to optimize for it is by making an adaptive diff/patch algorithm.

leeoniya commented 7 years ago

@isiahmeadows

The closest I've found to a decent usable benchmark is this

js-repaint-perfs is a very poor benchmark for multiple reasons

it makes the reconciler's job laughably trivial
it only tests className and textContent/nodeValue updates
its measuring strategy is wildly unstable unless you use localvoid's perf-monitor (which the live version does)

js-framework-benchmark is unambiguously better in every sense. localvoid's uibench and vdom-benchmark are also pretty thorough but the feedback loop is harder to grok.

i use dbmonster @ 0% as a quick-and-dirty test of pure framework overhead and the effect of various optimization experiments. it does help to quickly catch obvious regressions.

lhorie commented 7 years ago

@thysultan I know push, pop get O(1) complexity on a copy operation, but sort, reverse, shift, unshift, splice, etc can easily be worst case (O(n)). The claim here is that adding those potential costs in js land will be offset by gains in GC land. This claim, to me, falls into the realm of extraordinary claims that require extraordinary evidence.

Benchmarks aren't going to tell you that much, because they usually only render small trees

@isiahmeadows I think you got that backwards. Benchmarks usually render large trees as an attempt to make measurable times big enough to drown out confounding factors. I would argue that rendering thousands of DOM elements is something a real app should actively avoid, if at all possible, because DOM costs increase linearly with element count and no amount of optimization can ever change that. People often talk about benchmarks in mobile, but seriously, why would any real life app want to spend battery chewing through thousands of things that can't possibly fit on a phone screen?

Last I checked, none of the other libraries split this down, precisely for perf reasons. Examples?

I'm not going to hunt down all of them, but here's inferno ( https://github.com/infernojs/inferno/blob/master/packages/inferno/src/DOM/patching.ts#L457 ) and kivi ( https://github.com/localvoid/kivi/blob/master/lib/vnode.ts#L1520 ).

Here's a few more concrete examples of how splitting this up could result in faster and easier-to-maintain code

Honestly, branch prediction has never been an optimization target for any perf work I've done, so I have no idea what magnitude of gains we're talking about here. I've only ever seen it come up in super convoluted micro benches in C. If you want to pursue that, it would be ideal to have some way of validating before/after performance before attempting.

I don't agree with breaking down at the level you highlighted (because it would require the creation of extra temp objects, AND it would make mutations non-local), but I'm amenable to breaking out the payloads when a cursor hits a dead end (e.g. https://github.com/lhorie/mithril.js/blob/next/render/render.js#L177-L180 ). With that said, it's worth remembering that not all target engines might support the types of JIT optimizations you're looking to exploit (looking at you, IE 10).

I just mean these two assignments should be going the other way

I still don't understand. Simply flipping lhs and rhs on those would clobber state rather than keep it. I think you're underestimating what it would mean for vnode to remain referentially the same on every render in the way that vnode.state is. For example, what happens w/ onbeforeupdate(vnode, old)

It just means this will never be true

const C = {view(vnode) {
  if (tree == null) C.parent = vnode
  return C.tree = m("div")
}

m.render(document.body, m(C))
m.render(document.body, m(C))
assert.equal(C.parent, C.tree)

C.parent isn't equal to C.tree. Are you saying it should be?

When you remove the extra properties, you'll see a significant GC drop due to fewer medium-sized allocations

In Chrome? What about Firefox or IE? You're claiming that 4 types can still be optimized at least as well as 1 AND that we GC gains on top of that, in all engines. Again, I think this is an extraordinary claim and it needs some evidence. FWIW, I've experienced significant slowdowns in FF with as few as two identical structure objects during my tests last year, so I'm particularly skeptical about this claim.

My big push has been trying to reduce memory

Here's the problem I see with memory related efforts: I can measure the impact of object allocations in hyperscript micro-benchmarks (and hence why hyperscript code has such a complex memoization scheme), but dbmonster (which is supposed to be a very high stress test) on chrome never spikes over the initial 10mb that chrome sets aside as js heap memory.

If I put on my C hat, the type of optimizations I do in hyperscript basically translates to looking at incurred costs of malloc/free and then doing things to avoid calling malloc/free. Most of the discussions here are about GC heuristics and CPU branch prediction, which might as well be voodoo magic in terms of how transparent/predictable they are with regards to what the machine is actually doing today, and what it will be doing in version X of browser Y tomorrow.

Problem is, everyone is making benchmarks that are too simple, but real-world uses require better stress testing than just pushing everything through the same path each time.

The irony here is that 99% of the time people complain about perf, they say "oh I have this 5000 element thing"... Let's keep things in perspective: these proposal are all micro-optimizations, of the 99% effort for 1% gain variety. To address real app problems, you need solutions of algorithmic / design nature; micro-optimizations don't do squat when going against linear complexity.

pygy commented 7 years ago

@lhorie as a middle ground, you could keep the old vnodes but the new children arrays (and sorry for missing the point of your first remark). You'd probably still gain on the worse GC pauses. There would be a cost for moving objects around, though, but a priori minor since both arrays would be in cache at the point where the vnodes must be transferred.

dead-claudia commented 7 years ago

@lhorie Okay. I stand corrected for most of my ideas.

I'll remain on my point that we should at least reuse the initial vnode state through the entire component lifetime, if anything to reduce the node API's memory footprint during updates.

lhorie commented 7 years ago

we should at least reuse the initial vnode state

Doesn't that already happen? Do you mean in a specific case?

dead-claudia commented 7 years ago

@lhorie I mean specifically this, which currently does not happen AFAICT (note the conditional check):

var vnode

var Comp = {
  oninit(v) { vnode = v },
  view(v) {
    if (vnode !== v) throw new Error("Wrong vnode!")
    return m("div", [v.attrs.count])
  }
}

for (var i = 0; i < 10; i++) {
  m.render(document.body, m(Comp, {count: i}))
}

MithrilJS / mithril.js

Optimization ideas #1669