Change the implementation of call tree inversion

mstange commented 7 years ago

Inverting this profile is pretty slow: https://share.firefox.dev/3yivFDy

Currently, we go from stacks to inverted stacks to funcStacks.

It might be better to go from stacks to funcStacks, i.e. have non-inverted funcStack info, and then do the inversion when constructing the tree.

┆Issue is synchronized with this Jira Task

mstange commented 7 years ago

To get the roots, find all funcStacks that have at least one sample that points directly to that funcStack. Group those by func. Each group is a root node. To get the children of a node, collect the prefixFuncStacks of all the funcStacks of that node, and then group those by func. Each of those groups is a child node.

This approach requires less work upfront (invertStacks can be quite expensive), but creates more garbage in the process. At least I think it does - I'm not sure if there's a way to do the collection and grouping without having lots of JS objects involved in the process.

gregtatum commented 7 years ago

It would be good to have a canonical FuncStackTable that the indexes would stay the same, that way we can refer to funcStacks across various filtering operations.

gregtatum commented 6 years ago

I think this change would break all of the data transformation work. I'm going to tentatively close it as no longer relevant. Feel free to re-open if you disagree @mstange

mstange commented 6 years ago

I think the inversion still happens at the end of the transform pipeline, so most of what I wrote above should still apply. I still think this is worth doing.

gregtatum commented 6 years ago

Please test any patches off of the example inverted profile from #825:

Viewing https://perf-html.io/public/60310bb4ba148ca93a5f7af34f6bb202441210e2/calltree/?hiddenThreads=&invertCallstack&search=fbcdn.net&thread=3&threadOrder=0-2-3-1&v=2

gives the following profile:

https://perfht.ml/2GZbqxI

mstange commented 6 years ago

Here are some more thoughts on this.

The goal is to use the callNodeTable of the non-inverted profile for the display of the inverted tree.

Example:

Non-inverted callNode tree:

[callNodeIndex] [functionName]         [totalTime]  [selfTime]

0 (root)                               25           0
- 1 main                               25           0
  - 2 runLoop                          25           10
    - 3 appEvent                       8            1
      - 4 appNetworkEventHandler       4            2
        - 5 queueEvent                 2            1
          - 6 vecPush                  1            1
      - 7 appWorkFinishedEventHandler  3            3
    - 8 nativeEvent                    7            1
      - 9 nativeClickEventHandler      4            1
        - 10 queueEvent                3            2
          - 11 vecPush                 1            1
      - 12 nativeKeyEventHandler       2            1
        - 13 vecPush                   1            1

Inverted tree:

[rootCallNodeIndex] [nodeCallNodeIndex]   [time]
2 2 runLoop                               10
- 2 1 main                                10
  - 2 0 (root)                            10
5 5 queueEvent                            3
- 10 9 nativeClickEventHandler            2
  - 10 8 nativeEvent                      2
    - 10 2 runLoop                        2
      - 10 1 main                         2
        - 10 0 (root)                     2
- 5 4 appNetworkEventHandler              1
  - 5 3 appEvent                          1
    - 5 2 runLoop                         1
      - 5 1 main                          1
        - 5 0 (root)                      1
6 6 vecPush                               3
- 6 5 queueEvent                          2
  - 6 4 appNetworkEventHandler            1
    - 6 3 appEvent                        1
      - 6 2 runLoop                       1
        - 6 1 main                        1
          - 6 0 (root)                    1
  - 11 9 nativeClickEventHandler          1
    - 11 8 nativeEvent                    1
      - 11 2 runLoop                      1
        - 11 1 main                       1
          - 11 0 (root)                   1
- 13 12 nativeKeyEventHandler             1
  - 13 8 nativeEvent                      1
    - 13 2 runLoop                        1
      - 13 1 main                         1
        - 13 0 (root)                     1
...

Each tree node in the inverted tree is uniquely identified by the callPath to it, i.e. the list of funcs you need to follow from a function that has self time, to the node. For example, the node 6 5 queueEvent in the inverted tree is identified by the call path vecPush <- queueEvent, and the node 5 2 runLoop is identified by the call path queueEvent <- appNetworkEventHandler <- appEvent <- runLoop.

The roots of the inverted tree are all the funcs for which there exist a callNode with non-zero self time. For example, the vecPush root represents the three callNodes of the func vecPush with non-zero self time: 6, 11, and 13.

Each tree node in the inverted tree can also be expressed using two call nodes from the non-inverted tree:

Start with the call path that's represented by that tree node. Let n be the length of this call path.
Find all callNodes with non-zero self time whose n-long prefix path matches this call path in the non-inverted tree. Of these callNodes, take the one with the lowest index and call it rootCallNodeIndex. (Here, "root" means "root in the inverted tree".)
From rootCallNodeIndex, in the non-inverted tree, go along the call path to the start of the call path. Call the callNode at the start of the path nodeCallNodeIndex.

Now use the pair (rootCallNodeIndex, nodeCallNodeIndex) to represent this node in the inverted tree. Going from this index pair back to the call path is easier: You look up rootCallNodeIndex in the non-inverted tree and follow its prefix path until you hit nodeCallNodeIndex. The list of funcs you encounter on this path forms the call path.

Example 1: For the node vecPush <- queueEvent in the inverted tree, we can find two call nodes in the non-inverted tree that match this path: 6 and 11. (The 2-long prefix paths of both nodes is queueEvent -> vecPush.) 6 is the lower of the two. The call node path queueEvent -> vecPush which ends at callNode 6 starts at callNode 5. So we represent the node vecPush <- queueEvent in the inverted tree as the index pair (6, 5).

Example 2: vecPush <- queueEvent <- nativeClickEventHandler <- nativeEvent. There is only one call node in the non-inverted tree that matches this path, callNode 11. Here, the path starts at node 8. Represent the node vecPush <- queueEvent <- nativeClickEventHandler <- nativeEvent as the index pair (11, 8).

Example 3: Go from the index pair (10, 1) to the callPath. CallNode 10 is 10 queueEvent. Its prefix path is queueEvent <- nativeClickEventHandler <- nativeEvent <- runLoop <- main ... and we stop at main because that corresponds to callNode 1 main, callNode index 1.

[to be continued]

mstange commented 6 years ago

It turns out the canonical index computation is not worth it - computing this index takes a long time and we have to do it for many nodes. Using a stringified call path as the node ID works much better.

I have some of this working in a pretty messy branch in my repo: https://github.com/devtools-html/perf.html/compare/master...mstange:fast-inverted-tree?expand=1#diff-d973fedad79fc02f8035c8c4a63d5b3c

Here's my plan of attack for doing this properly:

[ ] Introduce a new type CallTreeNodeID which is different from IndexIntoCallNodeTable:
- [ ] Use that type in the CallTree API.
  - [ ] Figure out when to use null and -1 in the CallTree API.
  - [ ] Make sure that those edge cases still work as expected, e.g. pressing "left arrow" when a root node is selected should not move the selection.
- [ ] Remove all assumptions about CallTreeNodeID and IndexIntoCallNodeTable being the same:
  - [ ] Provide conversion functions on the CallTree.
  - [ ] Use those conversion functions in the selectors such as getSelectedCallNodeIndex.
  - This requires that selector to depend on the getCallTree selector, which depends on the range selection. So getSelectedCallNodeIndex (and, more importantly, getExpandedCallNodeIndexes) would be recomputed whenever the range selection changes, which is undesirable. So:
  - [ ] Split the call tree into two objects, one which is independent on the samples in the range (but knows how to do CallTreeNodeID <-> CallNodePath conversion) and one that does depend on the samples (and computes costs per node)?
  - [ ] Make the context menu operate on selectedFunc and selectedCallNodePath instead of selectedCallNodeIndex
  - [ ] Make the flame graph and its canvas keep around the tree object so that it can look up the func and the depth for a given call tree node ID.
  - [ ] For the ThreadStackGraph, maybe have a selector that exposes an array of bools that says whether sample i matches the currently selected call path
  - [ ] For keeping the selection in a good place when inverting the call tree, expose a getHeaviestStackUnderCallNodePath method on the call tree, instead of doing manual computations in invertCallNodePath.
- [ ] Define CallTreeNodeID as a string instead of a number, and insert the necessary conversions. This will slightly decrease performance for the non-inverted tree.
- [ ] Create an InvertedCallTree class and use it for inverted call trees:
- [ ] Stop computing an inverted stack table in the selector, create the InvertedCallTree with the non-inverted CallNodeTable.
- [ ] Turn CallTree into a flow interface and rename the existing CallTree class to something else, maybe NonInvertedCallTree.
- [ ] Implement InvertedCallTree in a similar way to how I did it in my branch:
  - CallTreeNodeIDs are string concatenations of the call node path (path.join('-'))
  - For every node, the tree stores a list of callNodes which are the call nodes from the non-inverted tree that contributed time to this subtree of the inverted call tree.

It would be nice to be able to use different node ID types the two tree types: integers for the non-inverted tree and strings for the inverted tree. But I don't think flow would make that easy - there's only one getSelectedCallNodeIndex selector, for example, and it needs to be clear about what it wants to return. And it can't return string | number because then, both trees would need to be able to handle both ID types.

Here's the CallTree type definition that we should have at the end (or something similar):

export type CallTree = Tree<CallTreeNodeID, CallTreeNodeDisplayData> & {
  preloadChildrenCache(): void,
  getNodeData(node: CallTreeNodeID): CallNodeData,
  getTimingDisplayData(CallTreeNodeID: CallTreeNodeID): Object,
  getCallNodePathFromNodeIndex(
    CallTreeNodeID: CallTreeNodeID | null
  ): CallNodePath,
  getNodeIndexFromCallNodePath(
    callNodePath: CallNodePath
  ): CallTreeNodeID | null,
  getNodeIndicesFromCallNodePaths(
    callNodePaths: Array<CallNodePath>
  ): Array<CallTreeNodeID | null>,
  getAllDescendants(node: CallTreeNodeID): Set<CallTreeNodeID>,
  getHeaviestStackUnderCallNodePath(callNodePath: CallNodePath): CallNodePath,
};

julienw commented 6 years ago

I feel like there are 2 things convoluted in one:

change the implementation of computing an inverted tree
make the inverted tree a "view" of the non-inverted tree -- that is, no inverted Thread object, and no inverted CallNodeTable.

Can we discuss about the good and bad consequences of these 2 changes ?

Especially, can't we change the implementation and keep the architecture we have ? My biggest concern is that we have a lot of code depending on this architecture, and that do not depend on a CallTree object (the latest being the sidebar implementation).

mstange commented 6 years ago

I'm pretty sure that the second part is required for the first part to have acceptable performance. The main reason why this implementation improves performance is the fact that we don't compute all nodes in the inverted tree upfront. We only compute the nodes we need to display.

The implementation isn't fundamentally different. It's just more lazy. So you can't just plug it into the existing architecture, which relies on having a complete CallNodeTable and an integer index for each node in the inverted tree.

mstange commented 6 years ago

Oh, there's one other important difference, apart from the laziness: The inversion operates on the CallNodeTable instead of on the StackTable of the non-inverted tree.

It might be worth trying that part out on its own: Instead of going regularStackTable -> invertedStackTable -> invertedCallNodeTable, go regularStackTable -> regularCallNodeTable -> invertedCallNodeTable.

gregtatum commented 4 years ago

See #1078

mstange commented 3 years ago

We should also make it so that the thread graphs are drawn based on the non-inverted thread. At the moment, if you check the "Invert call tree" checkbox, we compute all inverted threads, not just the selected thread.

julienw commented 3 years ago

Thanks, I finally filed #3313 about these 2 last comments. This sounds like a good self-contained idea to try first.

mstange commented 1 year ago

I found a profile which is slow to invert even on my fast machine: https://share.firefox.dev/3yivFDy (and I've edited the first comment of this issue to link to it)

mstange commented 1 year ago

It might be worth trying that part out on its own: Instead of going regularStackTable -> invertedStackTable -> invertedCallNodeTable, go regularStackTable -> regularCallNodeTable -> invertedCallNodeTable.

I gave this a try and it didn't help much. It cut down the inversion time from 40 seconds to 18 seconds on this profile, but we really want to go to 200ms or less. So I think some kind of lazy solution is necessary. I'll think about this some more.

mstange commented 11 months ago

It cut down the inversion time from 40 seconds to 18 seconds on this profile, but we really want to go to 200ms or less.

Haha, I set the goal pretty high here! But I'm happy to report that I achieved it: In the deploy preview in #4806, the profile above now inverts in 198ms. 🎉 I'll work on polishing up the patches and splitting them into reviewable pieces now.

firefox-devtools / profiler

Change the implementation of call tree inversion #337