Perl-Critic / PPI

53 stars 44 forks source link

keep a position cache to speed up sibling iteration #287

Closed haarg closed 1 month ago

haarg commented 1 year ago

Iterating over siblings using ->next_sibling or similar methods will take quadratic time if it has to search the parent's child list every time.

Add a cache of the position in the parent's child list. When the position of a child is requested, cache the position of every child of the parent. Verifying that the cache is accurate is nearly instant, so we can use that to invalidate the cache rather than invalidating on modifications to the structure.

Update the position method to use this cache, and change all of the child iterating functions to use position.

haarg commented 1 year ago

This is meant to be an optimization, so I'm not sure of a good way to test it. For a demonstration of the impact it has though:

$ cat iterate-siblings.pl
use strict;
use warnings;

use PPI::Document;

my $code = "{\n" . join('', map "foo$_ => $_,\n", 1 .. 3000) . "};\n";

my $doc = PPI::Document->new(\$code);
my $expr = $doc->find_first('PPI::Statement::Expression');

my $element = $expr->first_element;
while ($element = $element->next_sibling) {
  1;
}

$ time perl iterate-siblings.pl # using latest PPI release

real    0m14.398s
user    0m14.150s
sys 0m0.110s
$ time perl -Ilib iterate-siblings.pl # with position cache

real    0m0.249s
user    0m0.167s
sys 0m0.038s
haarg commented 1 year ago

I should also add that we ran into this problem when using the ValuesAndExpressions::ProhibitDuplicateHashKeys critic policy on files with large hashes.

wchristian commented 1 year ago

Ahead: This looks excellent and my direct urge would be to throw it in immediately. That said: I had a heck of a week and will check this tomorrow to see if i can spot any issues. Don't think i will, but due diligence and such.

wchristian commented 1 year ago

Quick question without having looked for it in the code: Does this invalidate the cache if an element's position within its parent is changed by inserting or removing something?

haarg commented 1 year ago

No, it doesn't. The cached position is always checked before being used, so it doesn't need to be invalidated on modifications. Validating the position doesn't slow anything down, so this seemed like a safer invalidation strategy than needing to ensure every write operation cleared the cache.

haarg commented 1 year ago

I guess it's worth pointing out that this could slow down iteration in some specific cases. If you are only iterating through a small number of siblings at the start of a long list, the new code will still end up calculating the positions of every sibling where the old code would not. I don't think this is a large concern though. Iterating through the list of siblings once is not slow, so it seems unlikely it would ever be a meaningful slowdown.

I did think of an alternate strategy that could avoid this. When checking the position of a given element, it could cache (and validate/recache) the positions of the requested element, as well as its previous and next sibling. Compared to what this PR implements, it ends up doing a bit more work for each individual operation, but could be cheaper if not iterating through all children. And iterating through all siblings would still be linear, not quadratic. It also makes the code a bit more complex. I felt that on balance the routine in this PR was a better solution.

wchristian commented 1 year ago

I'm prepping for merging this, but first need to actually get a solid set of dependent tests, which i'm fiddling with over here: https://github.com/Perl-Critic/PPI/actions/workflows/test-dependents.yml

I saw that it seems you adapted some of the code style conventions you saw, like shift shift shift, and forgoing postfix constructs. If you care and feel like, you could abbreviate some of them to not have the code be unnecessarily verbose. Otherwise it looks fine in execution flow.