facebook / yoga

Yoga is an embeddable layout engine targeting web standards.
https://yogalayout.dev/
MIT License
17.31k stars 1.43k forks source link

[JS] Reduce WASM call overhead #1545

Open Omxjep3434 opened 10 months ago

Omxjep3434 commented 10 months ago

The problem

It seems that Yoga API calls are relatively expensive because of WASM overhead. I've noticed that getting all computed values of a tree takes almost as long as the initial Yoga layout calculation. Correct me if I'm wrong, but I believe this is because of the barrier between JS and WASM. To get all computed values, 13 API calls must be made for each node: getComputedPadding/getComputedMargin/getComputedBorder must all be called 4 times for each edge. getComputedLayout must also be called.

My use case

Every time Yoga executes, I must get all computed values and transfer them to my custom layout system.

Benchmarks

I did a few benchmarks by setting up a tree with N nodes, and then getting the computed values of each node. Environment: Windows 10, Chrome 120.0.6099.200, Yoga 2.0.1. The benchmarks in Firefox have lower times, but the ratio between Yoga execution and getting the computed results is the same, if not worse.

100 nodes:

625 nodes:

2,500 nodes:

API proposal

I'm proposing a general getComputedValues() method that returns an object that includes all computed values for all edges so that the computed values of a node can be retrieved with only 1 API call. To make the method more versatile, there could be parameters that specify which values and edges to return. Perhaps a 'recursive' option could be added as well so that the computed values of an entire subtree could be returned with 1 API call.

nicoburns commented 10 months ago

This seems like a sensible proposal to me. Crossing the WASM boundary is indeed expensive. And in my testing, batching calls helps significantly (esp. in Firefox, which was already faster than Chrome). I have some performance experiments with WASM builds of Taffy and Yoga on this branch https://github.com/DioxusLabs/taffy/pull/394

It's also the case that Chrome's WASM implementation is just generally not very optimised compared to the one in Firefox. So if you're testing in Chrome then there is probably room for optimisation in the WASM-implementation itself.

NickGerleman commented 10 months ago

Wow, that’s a lot of overhead.

One more layer we have in here is “embind”. https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html

I’ve been keen on removing it for some unrelated reasons (https://github.com/facebook/yoga/issues/1507), but a solution for both of these might be to move the bindings to Emscripten direct function calls. https://emscripten.org/docs/porting/connecting_cpp_and_javascript/Interacting-with-code.html#interacting-with-code-direct-function-calls

I think this might lower call/bridging overhead, and the Yoga public API being C should have nice mappings for this most of the time. But I haven’t measured if this has a real impact.

NickGerleman commented 10 months ago

Would you be able to share the code used for the example, where chunkier accessor improved performance?

NickGerleman commented 10 months ago

If you are using Yoga in a case where your source may mutate over time, incremental Yoga layout can avoid the majority of the traversal, both for calculating, and later reading layout results.

But we are missing a trivial binding for the api for this https://github.com/facebook/yoga/issues/681

Omxjep3434 commented 10 months ago

Here's some more info about my benchmark. I updated my original post as well. Environment: Windows 10, Chrome 120.0.6099.200, Yoga 2.0.1. The benchmarks in Firefox have lower times, but the ratio between Yoga execution and getting the computed results is the same, if not worse.

If you guys can double check the results, that would be great. Hopefully there is not something with my environment or configuration that is causing these results.

Benchmark source code:

async function main() {
  const yoga = await loadYoga();
  postProcessingBenchmark(yoga, 2500);
}

function postProcessingBenchmark(yoga, numChildren) {
  const root = createRoot(yoga, numChildren);
  let start;

  start = performance.now();
  root.calculateLayout(10, 10);
  console.log("Yoga computation: ", performance.now() - start);
  start = performance.now();
  postProcessingLoop(root);
  console.log("Getting computed values: ", performance.now() - start);
}

function createRoot(yoga, numChildren) {
  const root = yoga.Node.create();
  root.setWidth(10);
  root.setHeight(10);

  for (let i = 0; i < numChildren; i += 1) {
    const node = yoga.Node.create();
    root.insertChild(node, i);
  }

  return root;
}

function postProcessingLoop(node) {
  const result = {
    paddingLeft: null, paddingRight: null, paddingTop: null, paddingBottom: null,
    marginLeft: null, marginRight: null, marginTop: null, marginBottom: null,
    borderLeft: null, borderRight: null, borderTop: null, borderBottom: null,
    computedLayout: null
  };

  result.paddingLeft = node.getComputedPadding(Edge.Left);
  result.paddingRight = node.getComputedPadding(Edge.Right);
  result.paddingTop = node.getComputedPadding(Edge.Top);
  result.paddingBottom = node.getComputedPadding(Edge.Bottom);

  result.marginLeft = node.getComputedMargin(Edge.Left);
  result.marginRight = node.getComputedMargin(Edge.Right);
  result.marginTop = node.getComputedMargin(Edge.Top);
  result.marginBottom = node.getComputedMargin(Edge.Bottom);

  result.borderLeft = node.getComputedBorder(Edge.Left);
  result.borderRight = node.getComputedBorder(Edge.Right);
  result.borderTop = node.getComputedBorder(Edge.Top);
  result.borderBottom = node.getComputedBorder(Edge.Bottom);

  result.computedLayout = node.getComputedLayout();

  for (let i = 0; i < node.getChildCount(); i += 1)
    postProcessingLoop(node.getChild(i));
}
Omxjep3434 commented 9 months ago

I read the 'embind' documentation you posted Nick.

From the performance section at the bottom of the page:

The call overhead for simple functions has been measured at about 200 ns. While there is room for further optimisation, so far its performance in real-world applications has proved to be more than acceptable.

This makes perfect sense with what we're seeing. 200ns 2500 nodes 13 API calls = 6.5ms. I didn't account for the calls to getChildCount() and getChild() as well while iterating through nodes, so that adds even more time. This doesn't account for all of the time, but it's certainly a ton of overhead.