YuriGor / deepdash

eachDeep, filterDeep, findDeep, someDeep, omitDeep, pickDeep, keysDeep etc.. Tree traversal library written in Underscore/Lodash fashion
https://deepdash.io/
MIT License
275 stars 12 forks source link

Great Refactoring (special Lukas edition) #55

Closed YuriGor closed 4 years ago

YuriGor commented 4 years ago

Ok, it's time to do this:

YuriGor commented 4 years ago

Ok, refactoring done in v5.0.4 :tada: ✓ no recursion ✓ code split into functions and deduplicated ✓ hope it's now a bit less unreadable ✗paths are still strings by default, I decided to not have breaking changes this time, and do this on next major. Instead, I'll mention in docs array paths are more performant ✗didn't measure performance before/after - no time for this :( but all the tests look like passed faster (if it's not me became slower after sleeping not enough lol)

@simlu enjoy :yum: maybe you will have some time to run a race of our horses again? last time mine was ~30% slower, what about now?

YuriGor commented 4 years ago

Looks like after refactoring it's a bit slower lol. Will need another performance-focused refactoring session.

simlu commented 4 years ago

@YuriGor Just had a look at your refactoring. That still looks very convoluted :)

A few questions:

Overall I'm still confused as to what exactly the iterate function is supposed to do. It clearly modifies the passed item. Is the modified item then later used somewhere? Or are all the computations on the item just used to call the callback?

simlu commented 4 years ago

I think the biggest performance problem you have (next to having string paths) is your context object. It gets computed and passed all the time. Even when it's not needed or accessed by the callback. Take a look how I've solved this in object-scan here using javascript getter:

Ultimately the signature of your callback should be much cleaner. It feels really redundant and convoluted atm.

simlu commented 4 years ago

Fun speed comparison to highlight the performance hit from "joined" strings:

data.json ```json { "store": { "book": [ { "category": "reference", "author": "Nigel Rees", "title": "Sayings of the Century", "price": 8.95 }, { "category": "fiction", "author": "Evelyn Waugh", "title": "Sword of Honour", "price": 12.99 }, { "category": "fiction", "author": "Herman Melville", "title": "Moby Dick", "isbn": "0-553-21311-3", "price": 8.99 }, { "category": "fiction", "author": "J. R. R. Tolkien", "title": "The Lord of the Rings", "isbn": "0-395-19395-8", "price": 22.99 } ], "bicycle": { "color": "red", "price": 19.95 } } } ```
test.js ```js const assert = require('assert'); const _ = require('lodash'); const { paths } = require('deepdash')(_); const objectScan = require('object-scan'); const data = require('./data.json'); const bench = (name, fnRaw, count = 10000) => { const fn = fnRaw(); for (let i = 0; i < count * 10; i += 1) { fn(); } console.time(name); for (let i = 0; i < count; i += 1) { fn(); } console.timeEnd(name); }; bench('deepdash', () => { return () => { const r = paths(data); assert(r.length === 20); }; }); bench('object-scan: joined=true', () => { const scanner = objectScan(['**'], { joined: true, filterFn: ({ value }) => !(value instanceof Object) }); return () => { const r = scanner(data); assert(r.length === 20); }; }); bench('object-scan: joined=false', () => { const scanner = objectScan(['**'], { joined: false, filterFn: ({ value }) => !(value instanceof Object) }); return () => { const r = scanner(data); assert(r.length === 20); }; }); ```
simlu commented 4 years ago

Not having the ability to specify paths for the different functions is a huge performance disadvantage for deepdash as can be seen in the next example:

data.json ```json { "title": { "plain": "Send Money" }, "fieldset": [ { "label": { "plain": "Personal Info Section" }, "fieldset": [ { "field": [ { "label": { "plain": "First Name" }, "value": { "plain": "Bob" }, "id": "a_1" }, { "label": { "plain": "Last Name" }, "value": { "plain": "Hogan" }, "id": "a_2" } ], "id": "a_8" } ], "id": "a_5" }, { "label": { "plain": "Billing Details Section" }, "fieldset": { "field": { "choices": { "choice": { "label": { "plain": "Gift" }, "id": "a_17", "switch": "" } }, "label": { "plain": "Choose a category:" }, "value": { "plain": "Gift" }, "id": "a_14" }, "fieldset": { "label": { "plain": "" }, "field": [ { "choices": { "choice": { "label": { "plain": "Other" }, "id": "a_25", "switch": "" } }, "label": { "plain": "Amount" }, "value": { "plain": "Other" }, "id": "a_21" }, { "label": { "plain": "Other Amount" }, "value": { "plain": "200" }, "id": "a_20" } ], "id": "a_26" }, "id": "a_13" }, "id": "a_12" } ] } ```
data.js ```js const assert = require('assert'); const _ = require('lodash'); const { someDeep } = require('deepdash')(_); const objectScan = require('object-scan'); const data = require('./data.json'); const bench = (name, fnRaw, count = 10000) => { const fn = fnRaw(); for (let i = 0; i < count * 10; i += 1) { fn(); } console.time(name); for (let i = 0; i < count; i += 1) { fn(); } console.timeEnd(name); }; bench('deepdash', () => { return () => { const r = someDeep(data, (value, key, parent, { path }) => { return path.endsWith('choices.choice.label.plain') && value === 'Gift'; }); assert(r === true); }; }); bench('object-scan', () => { const scanner = objectScan(['**.choices.choice.label.plain'], { filterFn: ({ key, value, context }) => { if (value === 'Gift') { // eslint-disable-next-line no-param-reassign context.r = true; } }, breakFn: ({ context }) => { return context.r; } }); return () => { const { r } = scanner(data, { r: false }); assert(r === true); }; }); ```
YuriGor commented 4 years ago

My measurements show slowest is getScope Also having more functions adds some overheard. Will take a closer look on your scripts, thanks.

YuriGor commented 4 years ago

Ok, I merged most of the functions back into iterate method - it helped to save ~5% of my performance measurement set execution time. I also introduced some extra optimizations, but also minor ones. Now I have time to answer your questions in details:

- When you invoke iterate you always call it with value=obj and obj=obj. Why do we need the parameter to be passed in twice?

this is left from times when it was a recursive function, so I needed to always keep a reference to source obj, while changing value starting from source obj to deeper children I, of course, could set the value as a default obj, if depth = 0 but why to do this extra check on each iteration, if iterate is a private function and I can just pass a double argument from eachDeep. Now, iterate is not recursive so I can remove this duplication because I can check this defaults only once - it will cost nothing.

- You keep assigning things onto item. Can we somehow remove that? Would be great to make the signature of the invoke very clean: object and options (containing callback) and then iterate over the object. Unless you are not just iterating?

I, optionally, invoke callback twice - before and after iterating over children of each node, so I need to preserve all the calculations and intermediate values and push all this into the stack, so I can pop and reuse it after done with children. I already tried to, but I maybe need an additional review of what do I really need to preserve and what can be left to GC.

- You have so many weird settings on item that are (arguably) not needed: E.g. inited (just keep a stack reference? I'm not sure what the purpose is), scope (why do you need to compute and store this? Shouldn't it just be computed on the fly and passed into the callback)

Partially answered above. inited is just flag indicating if for current node I already did all the needed calculations (if it's got from the stack). If you will find some not needed setting - please let me know, but run tests before :) specs are pretty complex and all this options came from real use cases.

- I think the biggest performance problem you have (next to having string paths) is your context object. It gets computed and passed all the time.

No, it's not, I create it only if it's needed for a callback, see needCallback flag calculated before. Also, I tried getters - using it has a price and in my case, this price is higher than just creating the context each time I need it. Most context fields are calculated anyway because needed internally, creating a context is just composing an object with already existed values - nothing to save here. And a lot of other methods in Deepdash are using this context intensively. I even tried to replace the string formatted path by getter, but still, I am considering the performance of all the library, check prof folder - this is how I measure it. So still, ultimate but costly solution - to remove built-in support of string path format and let user convert the path to string by himself if he needs it.

I also had an idea, similar to getter, to analyze callback signature, and if a context argument is not expected by callback - don't create it. But it's a bit too hacky and also doesn't worth it for the same reasons.

I still need to see your benches - haven't got a chance yet

YuriGor commented 4 years ago

just checked your first bench, in array path format objectScan is lightning fast, but does it mean you toPath is so slow? Compare with this:

bench('deepdash: pathFormat=array', () => {
  return () => {
    r = paths(data, { pathFormat: 'array' });
    assert(r.length === 20);
  };
});
deepdash: pathFormat=string: 307.257ms
deepdash: pathFormat=array: 243.978ms
object-scan: joined=true: 342.197ms
object-scan: joined=false: 99.273ms

so mine pathToString ate 70ms and yours - 240ms

Is it because of escape is called multiple times once per each key? Maybe you need a separate version of toPath to use for real paths, and use escaped version for user input only?

YuriGor commented 4 years ago

Also a question - I didn't find in you code how are you doing parallel processing of branches. Do you spawn child processes for this? or using worker threads? How about browser support?

simlu commented 4 years ago

just checked your first bench, in array path format objectScan is lightning fast, but does it mean you toPath is so slow?

I think so, however we do need escaping. Otherwise we couldn't use the output needles as input for another search. If performance is an issue and no escaped variables are present, the user can join paths themselves.

Also a question - I didn't find in you code how are you doing parallel processing of branches. Do you spawn child processes for this? or using worker threads? How about browser support?

Object-scan does not us multiple processes. What I meant is: There is a pre-processing step, looking at all the passed needles and building a search tree. We can then use that search tree to traverse an arbitrary amount of needles in parallel, while traversing the haystack only once at most. This makes the use case of traversing multiple needles very fast and guarantees a single traversal (very, very convenient for a lot of use cases).

YuriGor commented 4 years ago

Finally, v5.0.6 is almost 2x faster (in some cases) then it used to be before refactoring (5.0.3) Only the core 'iterate' method was optimized, while other methods sometimes are also pretty heavy, so there is still space for optimization.

parents collection in context callback arg converted to getter with building on-demand only. Now using it maybe is not as fast as it used to be, but as I see people rarely (=never saw this) use this property.

Especially good point - I found a way to build text version of path faster by analyzing not only the path itself but also context information available on build time, so now the difference between array and string path formats is not so significant.

That's enough for now, there are more important tasks waiting, then hunting for milliseconds so I need to make myself stop doing this.