ViacomInc / data-point

JavaScript Utility for collecting, processing and transforming data
Apache License 2.0
67 stars 34 forks source link

Resolve path reducers synchronously #245

Open raingerber opened 6 years ago

raingerber commented 6 years ago

Problem description:

Every reducer/entity is asynchronous, which makes them simpler to implement and reason about, but this comes with performance drawbacks. Take this for example:

const { map } = DataPoint.helpers

const reducer = map('$a')

const input = [
  { a: 1 },
  { a: 2 }
  // pretend there's, like, 1000 more elements in the array
]

dataPoint.resolve(reducer, input)

This operation could be synchronous, but DataPoint will resolve it using Promise.map, and the result from each $a reducer will be wrapped in a promise (which is much slower than using Array.map without promises).

Suggested solution:

Reducers should have an __async__ boolean property to indicate if the resolution should be asynchronous. This would apply mainly to path reducers (where __async__ would always be false), as well as reducers that can use path reducers internally. For example, this pseudo code would apply to reducer lists:

// reducer-list/factory.js

function create (source) {
  const reducer = createListReducer(source)
  reducer.__async__ = !reducer.reducers.every(r => r.__async__ === false)
  return reducer
}

// reducer-list/resolve.js

function resolve (reducer, input) {
  if (reducer.__async__) {
    // uses Promise.map
    return resolveAsync(reducer, input)
  }

  // uses Array.map
  return resolveSync(reducer, input)
}

There would be similar code for reducer objects, map, filter, find, assign, and parallel. This would also require a change to the resolveReducer function in reducer-types/resolve.js, because we can't assume that every reducer returns a promise (and the resolve function for omit, pick, and constant should no longer wrap their return values with Promise.resolve). This would be the most difficult change.

Benchmarks

These are benchmarks from running a sync vs async version of map('$a') on an array with 100 elements. There are... substantial differences:

Benchmarking: async [ASYNC]

 1 async x 1,810 ops/sec ±2.47% (77 runs sampled)
 2 async x 1,854 ops/sec ±2.58% (77 runs sampled)
 3 async x 1,749 ops/sec ±3.71% (73 runs sampled)
 4 async x 1,816 ops/sec ±2.32% (76 runs sampled)
 5 async x 1,746 ops/sec ±2.92% (73 runs sampled)

 Ran async (5 times) with an average of 1,795 ops/sec
  Fastest: 1,854 ops/sec
  Average: 1,795 ops/sec
  Median : 1,779 ops/sec
  Slowest: 1,745 ops/sec

Benchmarking: sync [SYNC]

 1 sync x 151,837 ops/sec ±2.29% (81 runs sampled)
 2 sync x 150,352 ops/sec ±3.58% (79 runs sampled)
 3 sync x 158,823 ops/sec ±1.35% (84 runs sampled)
 4 sync x 150,055 ops/sec ±3.17% (78 runs sampled)
 5 sync x 151,165 ops/sec ±1.80% (81 runs sampled)

 Ran sync (5 times) with an average of 152,446 ops/sec
  Fastest: 158,822 ops/sec
  Average: 152,446 ops/sec
  Median : 150,758 ops/sec
  Slowest: 150,055 ops/sec

Report:
 Speed: sync was faster by 98.82% (150,758Hz vs 1,779Hz)

NOTE: these benchmarks were done with bench-trial, but the math doesn't look correct -- 150,758 is much more than 98.82% faster than 1,779...

paulmolluzzo commented 6 years ago

What a find! Could we implement support for the __async__ property into reducers one by one or would we need to add it to all in a single PR? (Thinking smaller PRs are easier to get done and released.)

Report: Speed: sync was faster by 98.82% (150,758Hz vs 1,779Hz)

😳

150,758 is much more than 98.82% faster than 1,779...

Wouldn’t it be (150,758 - 1,779) / 1,779 = 83.7x faster (8,370%)? 🤔

I’m not sure about this equation: https://github.com/ViacomInc/data-point/blob/master/packages/bench-trial/runner.js#L155

Edit: either fixed my math or broke it or was wrong both times.

raingerber commented 6 years ago

Good point - we can modify the reducers one by one, but the first step is adding support for __async__ in the main resolveReducer function:

// simplified version of this function
function resolveReducer (manager, accumulator, reducer) {
  const resolve = getResolveFunction(reducer)
  // the problem is that "resolve" MIGHT return a promise, but might not
  const result = resolve(manager, resolveReducer, accumulator, reducer)
  return result
}

These are possible solutions (I haven't tested how they'll affect performance though):

  1. async / await - a simple, idiomatic solution that we don't officially support 😞

  2. Using callbacks:

// adds an error first "done" callback
function resolveReducer (manager, accumulator, reducer, done) {
  const resolve = getResolveFunction(reducer)
  const result = resolve(manager, resolveReducer, accumulator, reducer)
  // adding the "__async__" property to the resolve
  // function instead of the reducer instance...
  if (resolve.__async__) {
    // in this case, "result" will be a promise
    result.asCallback(done)
  } else {
    done(null, result)
  }
}

Note that the performance boost wouldn't come directly from this change (and I'm not sure how this function's performance would be affected). The benefit would come after changing the resolve functions for reducers like map, filter etc. so they resolve synchronously when possible

raingerber commented 6 years ago

Also, I think you have the right equation for bench-trial; pre-approving that change 😊