Performance: early return/abort of traversal algorithm

bpmn-io / bpmnlint

Validate BPMN diagrams based on configurable lint rules.

MIT License

121 stars 36 forks source link

Performance: early return/abort of traversal algorithm #53

Closed adroste closed 3 years ago

adroste commented 3 years ago

Is your feature request related to a problem? Please describe

I want to be able to abort the traversal process early. Proposed solution is a performance optimization if a rule does not require a full traversal.

Describe the solution you'd like

Change line 9 in lib/traversal.js:

module.exports = function traverse(element, fn) {
  if (fn(element))
    return;

  // ...
};

Then you can abort traversal early by returning true, e.g.:

function check(node, reporter) {
  // do stuff ...

  if (something)
    return true; // stops traversal

  // ...
}

Describe alternatives you've considered

Change nothing and let the algorithm do its thing even though it costs performance.

Caveats

Proposed solution would not work with promises. But I think this is negligible.

nikku commented 3 years ago

In most cases users check for multiple rules. In this case I do not see how this is supposed to be a performance improvement.

Are you already familiar with how our traversal algorithm / rule invocation works?

adroste commented 3 years ago

Yes, it's a simple dfs that calls the supplied function for every node. In fact, I read and debugged so much code regarding bpmn-js and their base libs, that I could join your team to write a proper documentation 👍 (no offense)

I have several use cases for this short circuit approach.

One of them: Some elements I use, bpmn as well as extensions, can only occur once in my tree. (E.g. an extension element for bpmn:Definitions). I is not necessary to continue traversal once the element the rule applies to was handled.

Another one: see #52

nikku commented 3 years ago

In fact, I read and debugged so much code regarding bpmn-js and their base libs, that I could join your team to write a proper documentation +1 (no offense)

Always happy to see guys contribute, in whatever capacity possible :wink:.

Never mind my previous comment. I was under the impression that we traverse once, too.

We do not do that, deliberately and I remember now, why: Each rule is run in isolation and a rule screwing things up is not supposed to interfere with other rules doing their things.

Let's put that current way of internal working aside though for a moment. Wouldn't the proper "optimization" a traversal that checks all rules in a single traversal pass?

adroste commented 3 years ago

I think it does not matter If you do the loop:

// Approach 1 - your proposal

for node in nodes:
  for rule in rules:
    rule(node)

// Approach 2 - current implementation

for rule in rules:
  for node in nodes:
    rule(node)

Despite, I like the fact, that the rules run in isolation.

Nevertheless, this does not change anything regarding my proposal.

With Approach 2 you can skip the rest of the iteration/traversal if the rule "finished early". With Approach 1 you would just change the nested loops but still call the rule every time.

nikku commented 3 years ago

Returning false to "not continue traversing" from the rule's perspective is a common pattern, this would be the reason why I'd implement it. Note that I'd not implement that as an abort signal but rather as a stop diving deeper signal.

It makes more sense once a user is more explicit about when a check is supposed to happen (on enter or leave of a given node). As you mention, regardless of which traversal strategy is used (https://github.com/bpmn-io/bpmnlint/issues/53#issuecomment-752228176), this could still be valuable in some use cases.

Checkout https://github.com/bpmn-io/bpmnlint/pull/54, it contains a proof of concept of the feature.

adroste commented 3 years ago

Works for me, see https://github.com/bpmn-io/bpmnlint/pull/54#issuecomment-752307743