Number-Phone-3.6001: performance drop since 5.31.6

eserte commented 4 years ago

As a followup to @jkeenan's findings in https://github.com/Perl/perl5/issues/17553, here are the runtimes of Number-Phone-3.6001's test suite on my smoker systems (various freebsd and linux systems) sorted by perl version:

$ perl -Mversion -nle '/X-Test-Reporter-Perl: (.*)/ and $v{$ARGV} = version->new($1); /, (\d+ wallclock secs.*)/ and $t{$ARGV} = $1;  END { for $f (sort { $v{$a} <=> $v{$b} } keys %v) { print "$v{$f}: $t{$f}" } }'  {2019,2020}-*/pass.Number-Phone-3.6001.*
...
v5.28.0: 29 wallclock secs ( 1.45 usr  0.19 sys + 26.47 cusr  2.58 csys = 30.69 CPU)
v5.28.1: 28 wallclock secs ( 1.23 usr  0.41 sys + 25.24 cusr  3.24 csys = 30.12 CPU)
v5.28.1: 26 wallclock secs ( 0.11 usr  0.79 sys + 13.38 cusr  9.88 csys = 24.16 CPU)
v5.28.1: 26 wallclock secs ( 0.75 usr  0.14 sys + 21.31 cusr  2.01 csys = 24.21 CPU)
v5.28.1: 16 wallclock secs ( 0.23 usr  0.36 sys + 12.84 cusr  2.84 csys = 16.27 CPU)
v5.28.1: 31 wallclock secs ( 1.46 usr  0.18 sys + 27.67 cusr  2.52 csys = 31.83 CPU)
v5.28.1: 24 wallclock secs ( 1.34 usr  0.19 sys + 21.73 cusr  1.84 csys = 25.10 CPU)
v5.28.1: 17 wallclock secs ( 0.26 usr  0.32 sys + 13.16 cusr  3.20 csys = 16.94 CPU)
v5.28.2: 28 wallclock secs ( 1.44 usr  0.19 sys + 25.55 cusr  2.43 csys = 29.61 CPU)
v5.28.2: 26 wallclock secs ( 0.71 usr  0.14 sys + 21.26 cusr  2.06 csys = 24.17 CPU)
v5.28.2: 24 wallclock secs ( 0.40 usr  0.11 sys + 20.61 cusr  1.88 csys = 23.00 CPU)
v5.28.2: 21 wallclock secs ( 1.31 usr  0.15 sys + 18.89 cusr  1.81 csys = 22.16 CPU)
v5.28.2: 16 wallclock secs ( 0.23 usr  0.38 sys + 13.05 cusr  2.54 csys = 16.20 CPU)
v5.28.2: 46 wallclock secs ( 0.48 usr  0.09 sys + 21.11 cusr  1.57 csys = 23.25 CPU)
v5.28.2: 22 wallclock secs ( 1.35 usr  0.17 sys + 19.72 cusr  1.94 csys = 23.18 CPU)
v5.28.2: 15 wallclock secs ( 0.29 usr  0.35 sys + 12.82 cusr  2.64 csys = 16.10 CPU)
v5.28.2: 16 wallclock secs ( 0.27 usr  0.36 sys + 13.12 cusr  2.39 csys = 16.14 CPU)
v5.28.2: 16 wallclock secs ( 0.27 usr  0.29 sys + 13.08 cusr  2.81 csys = 16.45 CPU)
v5.28.2: 22 wallclock secs ( 0.37 usr  0.10 sys + 20.04 cusr  1.63 csys = 22.14 CPU)
v5.30.0: 44 wallclock secs ( 1.55 usr  0.41 sys + 38.04 cusr  5.95 csys = 45.96 CPU)
v5.30.0: 17 wallclock secs ( 0.26 usr  0.33 sys + 13.65 cusr  2.84 csys = 17.08 CPU)
v5.30.0: 28 wallclock secs ( 0.73 usr  0.14 sys + 22.94 cusr  2.09 csys = 25.90 CPU)
v5.30.0: 25 wallclock secs ( 1.39 usr  0.19 sys + 22.48 cusr  1.85 csys = 25.91 CPU)
v5.30.0: 17 wallclock secs ( 0.20 usr  0.38 sys + 13.66 cusr  2.74 csys = 16.98 CPU)
v5.30.0: 31 wallclock secs ( 1.38 usr  0.27 sys + 26.78 cusr  3.73 csys = 32.16 CPU)
v5.30.0: 24 wallclock secs ( 0.36 usr  0.13 sys + 21.12 cusr  1.71 csys = 23.32 CPU)
v5.30.0: 31 wallclock secs ( 1.55 usr  0.17 sys + 28.13 cusr  2.66 csys = 32.51 CPU)
v5.30.1: 25 wallclock secs ( 0.35 usr  0.19 sys + 21.96 cusr  1.93 csys = 24.43 CPU)
v5.30.1: 24 wallclock secs ( 0.32 usr  0.20 sys + 21.36 cusr  1.80 csys = 23.68 CPU)
v5.30.1: 16 wallclock secs ( 0.22 usr  0.34 sys + 13.54 cusr  2.96 csys = 17.06 CPU)
v5.30.1: 94 wallclock secs ( 1.46 usr  0.30 sys + 40.80 cusr 44.91 csys = 87.48 CPU)
v5.30.1: 37 wallclock secs ( 0.91 usr  0.16 sys + 29.66 cusr  5.44 csys = 36.16 CPU)
v5.30.1: 88 wallclock secs ( 0.67 usr  0.13 sys + 24.96 cusr  2.35 csys = 28.11 CPU)
v5.30.1: 24 wallclock secs ( 1.41 usr  0.16 sys + 21.74 cusr  1.87 csys = 25.18 CPU)
v5.30.1: 18 wallclock secs ( 0.30 usr  0.38 sys + 14.69 cusr  3.40 csys = 18.77 CPU)
v5.30.1: 25 wallclock secs ( 1.33 usr  0.19 sys + 22.44 cusr  1.87 csys = 25.83 CPU)
v5.30.1: 17 wallclock secs ( 0.20 usr  0.40 sys + 13.54 cusr  3.41 csys = 17.55 CPU)
v5.30.1: 30 wallclock secs ( 0.79 usr  0.19 sys + 24.17 cusr  4.09 csys = 29.23 CPU)
v5.30.1: 32 wallclock secs ( 0.50 usr  0.09 sys + 27.47 cusr  1.99 csys = 30.05 CPU)
v5.31.2: 18 wallclock secs ( 0.34 usr  0.25 sys + 14.36 cusr  2.81 csys = 17.76 CPU)
v5.31.2: 18 wallclock secs ( 0.28 usr  0.37 sys + 14.51 cusr  2.81 csys = 17.97 CPU)
v5.31.4: 17 wallclock secs ( 0.27 usr  0.30 sys + 13.88 cusr  2.82 csys = 17.27 CPU)
v5.31.4: 17 wallclock secs ( 0.16 usr  0.42 sys + 13.70 cusr  2.46 csys = 16.74 CPU)
v5.31.4: 18 wallclock secs ( 0.29 usr  0.35 sys + 15.10 cusr  2.57 csys = 18.31 CPU)
v5.31.5: 16 wallclock secs ( 0.15 usr  0.44 sys + 13.63 cusr  2.66 csys = 16.88 CPU)
v5.31.5: 17 wallclock secs ( 0.29 usr  0.29 sys + 13.97 cusr  2.93 csys = 17.48 CPU)
v5.31.5: 17 wallclock secs ( 0.24 usr  0.38 sys + 14.73 cusr  2.72 csys = 18.07 CPU)
v5.31.5: 30 wallclock secs ( 0.69 usr  0.12 sys + 26.17 cusr  1.96 csys = 28.94 CPU)
v5.31.5: 16 wallclock secs ( 0.52 usr  0.04 sys + 15.64 cusr  0.58 csys = 16.78 CPU)
v5.31.6: 358 wallclock secs ( 1.29 usr  0.31 sys + 353.06 cusr  3.59 csys = 358.25 CPU)
v5.31.6: 426 wallclock secs ( 1.36 usr  0.19 sys + 420.02 cusr  2.29 csys = 423.86 CPU)
v5.31.6: 360 wallclock secs ( 0.71 usr  0.14 sys + 341.76 cusr  5.55 csys = 348.16 CPU)
v5.31.6: 205 wallclock secs ( 0.24 usr  0.35 sys + 202.39 cusr  2.44 csys = 205.42 CPU)
v5.31.6: 747 wallclock secs ( 0.98 usr  0.14 sys + 734.67 cusr  3.72 csys = 739.51 CPU)
v5.31.6: 207 wallclock secs ( 0.19 usr  0.40 sys + 203.86 cusr  2.66 csys = 207.11 CPU)
v5.31.6: 836 wallclock secs ( 0.89 usr  0.17 sys + 451.68 cusr  6.21 csys = 458.95 CPU)
v5.31.6: 210 wallclock secs ( 0.27 usr  0.37 sys + 206.52 cusr  3.01 csys = 210.17 CPU)
v5.31.6: 254 wallclock secs ( 0.50 usr  0.06 sys + 252.54 cusr  0.60 csys = 253.70 CPU)
v5.31.7: 482 wallclock secs ( 0.88 usr  0.17 sys + 424.76 cusr 10.64 csys = 436.45 CPU)
v5.31.7: 897 wallclock secs ( 0.82 usr  0.11 sys + 549.53 cusr  2.72 csys = 553.18 CPU)
v5.31.7: 213 wallclock secs ( 0.26 usr  0.37 sys + 210.11 cusr  2.73 csys = 213.47 CPU)
v5.31.7: 211 wallclock secs ( 0.23 usr  0.39 sys + 207.44 cusr  2.55 csys = 210.61 CPU)
v5.31.7: 264 wallclock secs ( 0.55 usr  0.02 sys + 261.01 cusr  0.95 csys = 262.53 CPU)
v5.31.7: 899 wallclock secs ( 0.92 usr  0.23 sys + 456.90 cusr  5.80 csys = 463.84 CPU)
v5.31.7: 495 wallclock secs ( 0.95 usr  0.19 sys + 466.20 cusr  4.55 csys = 471.89 CPU)
v5.31.7: 209 wallclock secs ( 0.28 usr  0.29 sys + 206.32 cusr  2.57 csys = 209.46 CPU)
v5.31.8: 256 wallclock secs ( 0.50 usr  0.04 sys + 254.44 cusr  0.74 csys = 255.72 CPU)
v5.31.8: 740 wallclock secs ( 0.78 usr  0.11 sys + 521.17 cusr  2.58 csys = 524.64 CPU)
v5.31.8: 213 wallclock secs ( 0.37 usr  0.25 sys + 209.77 cusr  3.00 csys = 213.39 CPU)
v5.31.8: 458 wallclock secs ( 1.20 usr  0.33 sys + 446.68 cusr  4.82 csys = 453.03 CPU)
v5.31.8: 894 wallclock secs ( 0.80 usr  0.31 sys + 460.94 cusr  5.95 csys = 468.00 CPU)
v5.31.8: 342 wallclock secs ( 0.72 usr  0.14 sys + 317.54 cusr  5.12 csys = 323.52 CPU)
v5.31.8: 205 wallclock secs ( 0.17 usr  0.42 sys + 202.13 cusr  2.51 csys = 205.23 CPU)
v5.31.8: 209 wallclock secs ( 0.25 usr  0.40 sys + 205.95 cusr  3.01 csys = 209.61 CPU)

Since v5.31.6 the test suite is much slower.

hvds commented 4 years ago

I don't know if it's the only cause of the slowdown, but I note that the first CN test in t/example-phone-numbers.t requires Number::Phone::StubCountry::CN, which is a 13MB source file that on my machine takes just over 2s to compile under v5.31.5, but over 100s under v5.31.6 (on a perl built with -O0).

Manual bisect points to 9f601cf3bb:

commit 9f601cf3bbfa6be3e2ab3468e77a7b79c80ff5cf (refs/bisect/bad)
Author: Tony Cook <tony@develop-help.com>
Date:   Fri Oct 25 05:26:53 2019 +1100

    Faster feature checks

    Perform only a bit check instead of a much more expensive hash
    lookup to test features.
[...]

I'm testing this with a build and install of perl, followed by build/install of Class-Inspector-1.34, inc-latest-0.500, File-ShareDir-Install-0.13 and File-ShareDir-1.116, then in the directory in which I've built Number-Phone-3.6001:

/usr/bin/time $PERL -Ilib -wle 'use Number::Phone::StubCountry::CN; 1'

I'll try to dig further - @tonycoz can you take a look as well?

hvds commented 4 years ago

For completeness I verified that building with -O6 has equivalent numbers:

# before 9f601cf3
1.23user 0.05system 0:01.29elapsed 100%CPU (0avgtext+0avgdata 253412maxresident)k
0inputs+0outputs (0major+61898minor)pagefaults 0swaps
# after 9f601cf3
119.26user 0.09system 1:59.35elapsed 99%CPU (0avgtext+0avgdata 255280maxresident)k
0inputs+0outputs (0major+62413minor)pagefaults 0swaps

jkeenan commented 4 years ago

I can confirm Hugo's findings for Linux.

$ thisperl -v |head -2 | tail -1
This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-111-g698f6cc7d5)) built for x86_64-linux
$ time thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1'

real    0m1.886s
user    0m1.814s
sys 0m0.072s

#####

$ thisperl -v | head -2 | tail -1
This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-112-g9f601cf3bb)) built for x86_64-linux
$ time thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1'

real    2m13.102s
user    2m13.001s
sys 0m0.100s

khwilliamson commented 4 years ago

On 2/11/20 12:27 PM, James E Keenan wrote:

I can confirm Hugo's findings for Linux.

|$ thisperl -v |head -2 | tail -1 This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-111-g698f6cc7d5)) built for x86_64-linux $ time thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1' real 0m1.886s user 0m1.814s sys 0m0.072s ##### $ thisperl -v | head -2 | tail -1 This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-112-g9f601cf3bb)) built for x86_64-linux $ time thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1' real 2m13.102s user 2m13.001s sys 0m0.100s |

So I think the ANYOFR commit being blamed is incorrect. I have done some profiling and changing things around to not use it, and it doesn't look like it's involved at all.

It was commited a couple weeks after 9f601cf.

khwilliamson commented 4 years ago

Let me try that table again, but quoting this time

-----------------------------------------------
                0.00    0.00      22/3729502     Perl_newPVOP <cycle 1> [181]
                0.00    0.00      77/3729502     Perl_newWHILEOP <cycle 1> [171]
                0.00    0.00     132/3729502     Perl_newPADOP <cycle 1> [168]
                0.01    0.00     182/3729502     Perl_newFOROP <cycle 1> [162]
                0.02    0.00     535/3729502     Perl_newPMOP <cycle 1> [124]
                0.08    0.00    2728/3729502     S_newMETHOP_internal <cycle 1> [49]
                0.08    0.00    2845/3729502     Perl_alloc_LOGOP [47]
                7.45    0.00  261747/3729502     Perl_newUNOP_AUX <cycle 1> [17]
                7.64    0.00  268519/3729502     Perl_newSTATEOP <cycle 1> [16]
                7.80    0.00  274226/3729502     Perl_newUNOP <cycle 1> [15]
               15.20    0.00  534185/3729502     Perl_newLISTOP <cycle 1> [12]
               15.29    0.00  537356/3729502     Perl_newOP <cycle 1> [11]
               22.42    0.00  787948/3729502     Perl_newBINOP <cycle 1> [10]
               30.13    0.00 1059000/3729502     Perl_newSVOP <cycle 1> [9]
[3]     97.2  106.11    0.00 3729502         Perl_Slab_Alloc [3]
                0.00    0.00   13387/13387       S_new_slab [569]
-----------------------------------------------

[Edited by @hvds to replace single backtick with triple backtick]

jkeenan commented 4 years ago

I can confirm Hugo's findings for Linux.

$ thisperl -v |head -2 | tail -1
This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-111-g698f6cc7d5)) built for x86_64-linux
$ time thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1'

real  0m1.886s
user  0m1.814s
sys   0m0.072s

#####

$ thisperl -v | head -2 | tail -1
This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-112-g9f601cf3bb)) built for x86_64-linux
$ time thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1'

real  2m13.102s
user  2m13.001s
sys   0m0.100s

And I was able to record similar differences in execution time on FreeBSD-11:

$ uname -mrs
FreeBSD 11.2-STABLE amd64

[perlmonger: Number-Phone-3.6001] $ $thisperl -v | head -2 | tail -1
This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-111-g698f6cc7d5)) built for amd64-freebsd-thread-multi
[perlmonger: Number-Phone-3.6001] $ /usr/bin/time $thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1'
        2.98 real         2.78 user         0.16 sys

[perlmonger: Number-Phone-3.6001] $ $thisperl -v | head -2 | tail -1
This is perl 5, version 31, subversion 6 (v5.31.6 (v5.31.5-112-g9f601cf3bb)) built for amd64-freebsd-thread-multi
[perlmonger: Number-Phone-3.6001] $ /usr/bin/time $thisperl -Ilib -wle 'use Number::Phone::StubCountry::CN; 1'
      110.18 real       109.80 user         0.23 sys

hvds commented 4 years ago

Oh, and here are some salient results of the profiling, on one entry on one test file in the distro

Can we get more detail, and a corresponding extract against perl before 9f601cf for comparison?

Cheers,

Hugo

khwilliamson commented 4 years ago

gcov.txt

That's the output just before the blamed commit. I don't know what details you want. It's clear to me that 97% of the time is being spent in mallocs; everything else is trivial by comparison.

jkeenan commented 4 years ago

My hunch is that most of the difference in performance between the two commits is associated with Number::Phone::StubCountry::CN.

If I clone t/example-phone-numbers.t to a file named t/nocn-example-phone-numbers.t, the comment out all the rows which appear to provide CN test data ...

$ diff t/example-phone-numbers.t t/nocn-example-phone-numbers.t 
725,739c725,739
< { class => 'Number::Phone::Lib', args => ['CN','+861012345678'], methods => ['is_fixed_line'] },
< { class => 'Number::Phone::Lib', args => ['CN','1012345678'], methods => ['is_fixed_line'] },
< { class => 'Number::Phone::Lib', args => ['+861012345678'], methods => ['is_fixed_line'] },
< { class => 'Number::Phone::Lib', args => ['CN','+8613123456789'], methods => ['is_mobile'] },
< { class => 'Number::Phone::Lib', args => ['CN','13123456789'], methods => ['is_mobile'] },
< { class => 'Number::Phone::Lib', args => ['+8613123456789'], methods => ['is_mobile'] },
< { class => 'Number::Phone::Lib', args => ['CN','+868001234567'], methods => ['is_tollfree'] },
< { class => 'Number::Phone::Lib', args => ['CN','8001234567'], methods => ['is_tollfree'] },
< { class => 'Number::Phone::Lib', args => ['+868001234567'], methods => ['is_tollfree'] },
< { class => 'Number::Phone::Lib', args => ['CN','+8616812345'], methods => ['is_specialrate'] },
< { class => 'Number::Phone::Lib', args => ['CN','16812345'], methods => ['is_specialrate'] },
< { class => 'Number::Phone::Lib', args => ['+8616812345'], methods => ['is_specialrate'] },
< { class => 'Number::Phone::Lib', args => ['CN','+864001234567'], methods => ['is_specialrate'] },
< { class => 'Number::Phone::Lib', args => ['CN','4001234567'], methods => ['is_specialrate'] },
< { class => 'Number::Phone::Lib', args => ['+864001234567'], methods => ['is_specialrate'] },
---
> #{ class => 'Number::Phone::Lib', args => ['CN','+861012345678'], methods => ['is_fixed_line'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','1012345678'], methods => ['is_fixed_line'] },
> #{ class => 'Number::Phone::Lib', args => ['+861012345678'], methods => ['is_fixed_line'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','+8613123456789'], methods => ['is_mobile'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','13123456789'], methods => ['is_mobile'] },
> #{ class => 'Number::Phone::Lib', args => ['+8613123456789'], methods => ['is_mobile'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','+868001234567'], methods => ['is_tollfree'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','8001234567'], methods => ['is_tollfree'] },
> #{ class => 'Number::Phone::Lib', args => ['+868001234567'], methods => ['is_tollfree'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','+8616812345'], methods => ['is_specialrate'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','16812345'], methods => ['is_specialrate'] },
> #{ class => 'Number::Phone::Lib', args => ['+8616812345'], methods => ['is_specialrate'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','+864001234567'], methods => ['is_specialrate'] },
> #{ class => 'Number::Phone::Lib', args => ['CN','4001234567'], methods => ['is_specialrate'] },
> #{ class => 'Number::Phone::Lib', args => ['+864001234567'], methods => ['is_specialrate'] },

... and then run the CN-less file through a prove built from a perl at the BEFORE and AFTER commits, then there is little difference in execution time between the two perls.

before:
[perlmonger: Number-Phone-3.6001] $ echo $COMMIT
9f601cf3bbfa6be3e2ab3468e77a7b79c80ff5cf^
[perlmonger: Number-Phone-3.6001] $ time $THISPROVE -b t/nocn-example-phone-numbers.t
t/nocn-example-phone-numbers.t .. ok      
All tests successful.
Files=1, Tests=3632,  6 wallclock secs ( 0.28 usr  0.02 sys +  5.61 cusr  0.34 csys =  6.25 CPU)
Result: PASS
        6.38 real         5.99 user         0.36 sys

after:
$ echo $COMMIT
9f601cf3bbfa6be3e2ab3468e77a7b79c80ff5cf
[perlmonger: Number-Phone-3.6001] $ time $THISPROVE -b t/nocn-example-phone-numbers.t
t/nocn-example-phone-numbers.t .. ok      
All tests successful.
Files=1, Tests=3632,  6 wallclock secs ( 0.27 usr  0.04 sys +  5.85 cusr  0.34 csys =  6.50 CPU)
Result: PASS
        6.67 real         6.26 user         0.39 sys

But when I restore the CN data rows, the test execution time for the AFTER commit soars into multiple minutes.

Thank you very much. Jim Keenan

hvds commented 4 years ago

gcov.txt

That's the output just before the blamed commit. I don't know what details you want. It's clear to me that 97% of the time is being spent in mallocs; everything else is trivial by comparison.

Well the question is why 97% of the time is being spent in mallocs, which is not immediately obvious from the original table alone.

So it looks like the top line changes before/after as:

index % time    self  children    called     name
                0.00    0.00      59/84866       Perl_newPVOP <cycle 1> [853]
                0.00    0.00      22/3729502     Perl_newPVOP <cycle 1> [181]

The headers don't make it hugely clear, but I assume this is saying that before, it called Perl_Slab_Alloc 84,866 times, of which 59 calls came from Perl_newPVOP. If that's correct, then the problem is not the speed of something at the low level, but the logic of something at the high level that has hugely changed the number of calls ending up in Slab_Alloc. If that's correct, then the calls occurring less than 100k times in your first table have not changed hugely, but the rest have: newUNOP_AUX, newSTATEOP, newUNOP, newLISTOP, newOP, newBINOP, newSVOP.

My first guess from that is that some optree is supposed to be cached but no longer is, so it is getting recreated many times. But the ratios of before and after are not consistent enough for it to be a single optree, so more likely a raft of different optrees each being recreated 10-200 times.

For details, I was hoping to understand how you generated this profile so I could better understand what we were looking at, and maybe reproduce it with variations myself.

Hugo

tonycoz commented 4 years ago

From profiling the code it's now spending a huge amount of time searching the free op list associated with a slab, for the ::CN case, Perl_Slab_Alloc() is being called ~3.6 million times in both f601cf3bb (before the change) and after the change, the problem is the time spent in Perl_Slab_Alloc() itself:

This is before the change:

and this is in blead:

after a lot of (way too much) consideration I can only think of one change - the size of a cop now had an added U32 for cop_features, so rather than finding an existing freed op that can be used, it searches the whole list before finally allocating a new one.

(I used callgrind and kcachegrind.)

I'll look at a fix tomorrow unless someone beats me to it.

hvds commented 4 years ago

Attempting to view those images I get secure connection failed with error code SSL_ERROR_NO_CYPHER_OVERLAP, trying from either firefox or chrome. Unfortunately neither browser shows what cyphers are offered by either side, so I'm not sure what I can do about that; I'll try contacting github support.

Update: I can see them fine if I s/https/http/.

tonycoz commented 4 years ago

They show fine in Firefox on Windows 10 here.

richardleach commented 4 years ago

@hvds - you'll still have to compare the two lists, but these two should be reliable:

Not the whole story if there's a proxy in the way.

demerphq commented 4 years ago

On Wed, 12 Feb 2020 at 05:56, Tony Cook notifications@github.com wrote:

From profiling the code it's now spending a huge amount of time searching the free op list associated with a slab, for the ::CN case, Perl_Slab_Alloc() is being called ~3.6 million times in both f601cf3bb (before the change) and after the change, the problem is the time spent in Perl_Slab_Alloc() itself:

This is before the change:

[image: image] https://user-images.githubusercontent.com/166429/74303934-9608dc00-4daf-11ea-894f-e535fc9ab0a9.png

and this is in blead: [image: image] https://user-images.githubusercontent.com/166429/74303913-86899300-4daf-11ea-9404-bd5ae371a0d7.png

after a lot of consideration I can only think of one change - the size of a cop now had an added U32 for cop_features, so rather than finding an existing freed op that can be used, it searches the whole list before finally allocating a new one.

(I used callgrind and kcachegrind.)

Could it be that you need to change the sizes that we allocate slabs for now that the op-sizes have changed?

I seem to recall that we precreate a set of different slabs based on the expected sizes of the objects we create, if you have changed the size of opcodes then maybe we just need to change that initialization.

Hazy guessing here based on a bell ringing faintly.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

hvds commented 4 years ago

@hvds - you'll still have to compare the two lists, but these two should be reliable:
* https://www.ssllabs.com/ssltest/analyze.html?d=user%2dimages.githubusercontent.com&latest

* https://clienttest.ssllabs.com:8443/ssltest/viewMyClient.html
Not the whole story if there's a proxy in the way.

Thanks @richardleach, as far as I can see we do have compatible TLS protocol versions and cipher suites; and there certainly shouldn't be a proxy (but I wouldn't put it past BT). I've reported it to github, but since the workaround is easy I don't want to waste anyone else's time on it.

tonycoz commented 4 years ago

Could it be that you need to change the sizes that we allocate slabs for now that the op-sizes have changed? I seem to recall that we precreate a set of different slabs based on the expected sizes of the objects we create, if you have changed the size of opcodes then maybe we just need to change that initialization. Hazy guessing here based on a bell ringing faintly. Yves

No, ops are allocated from slabs specific to the CV.

From what I can tell from 238a4c30b3 the old slab allocator didn't try to re-use freed ops at all, and didn't try to separate the slabs by size.

It simply allocated a constant-sized block, allocated from the block until it didn't have enough space, at which point it allocated a new block.

Each op had a reference count on the slab block, which was used to free the block when all the ops it contains were freed.

The newer CV-based allocator works similarly except that it keeps a freed list per slab and searches for that before allocating from free space in the slab or allocating a new slab. It expands the size of the slab allocated (up to a maximum) so that small functions end up with a small slab, but large functions don't end up with the cost of a large number of small allocations.

tonycoz commented 4 years ago

I have a solution I'm working on, but I'm running into a bug I haven't been able to resolve.

My working branch is at https://github.com/tonycoz/perl5/tree/17555-faster-freed-ops-reuse it throws an assertion in some tests, but I don't know what I've changed to cause it.

If anyone else has the time, please take a look and let me know what stupid mistake I've made.

hvds commented 4 years ago

I have a solution I'm working on, but I'm running into a bug I haven't been able to resolve.

My working branch is at https://github.com/tonycoz/perl5/tree/17555-faster-freed-ops-reuse it throws an assertion in some tests, but I don't know what I've changed to cause it.

If anyone else has the time, please take a look and let me know what stupid mistake I've made.

If it helps, the (only cause of) failure in t/re/reg_mesg.t reduces to:

% ./miniperl -e '/(?(??{}))/'
Unknown switch condition (?(...)) in regex; marked by <-- HERE in m/(?(? <-- HERE ?{}))/ at -e line 1.
miniperl: op.c:669: Perl_opslab_force_free: Assertion `savestack_count == slab->opslab_refcnt-1' failed.
Aborted (core dumped)
%

hvds commented 4 years ago

The t/re/overload.t failure reduces to much the same thing: a pattern (generated via overload this time) that includes a deferred eval block but also has a regexp-compile-time failure, in this case "Eval-group not allowed at runtime".

jkeenan commented 4 years ago

The t/re/overload.t failure reduces to much the same thing: a pattern (generated via overload this time) that includes a deferred eval block but also has a regexp-compile-time failure, in this case "Eval-group not allowed at runtime".

@hvds Where are you seeing these test failures? In Tony's branch?

hvds commented 4 years ago

The t/re/overload.t failure reduces to much the same thing: a pattern (generated via overload this time) that includes a deferred eval block but also has a regexp-compile-time failure, in this case "Eval-group not allowed at runtime".

@hvds Where are you seeing these test failures? In Tony's branch?

Yes, I built it and ran make minitest.

hvds commented 4 years ago

Applying only the DEBUG_Sv support, I get:

% ./miniperl -DSv -e '/(?(??{}))/' 2>&1 | perl -ple 's/ at -e .*//; s{([0-9a-f]{12})}{"\$".($a{$1} //= ++$s)}ge'
allocated new op slab sz 0x40, $1, head slab $1
allocating op at $2, slab $1, head slab $1
allocating op at $3, slab $1, head slab $1
allocating op at $4, slab $1, head slab $1
allocating op at $5, slab $1, head slab $1
allocating op at $6, slab $1, head slab $1
allocating op at $7, slab $1, head slab $1
allocating op at $8, slab $1, head slab $1
allocated new op slab sz 0x80, $9, head slab $1
allocating op at $10, slab $9, head slab $1
allocating op at $11, slab $9, head slab $1
Unknown switch condition (?(...)) in regex; marked by <-- HERE in m/(?(? <-- HERE ?{}))/
freeing    op at $8, slab $1, head slab $1
freeing    op at $7, slab $1, head slab $1
freeing    op at $4, slab $1, head slab $1
freeing    op at $5, slab $1, head slab $1
freeing    op at $6, slab $1, head slab $1
freeing    op at $3, slab $1, head slab $1
freeing    op at $2, slab $1, head slab $1
freeing    op at $11, slab $9, head slab $1
freeing    op at $10, slab $9, head slab $1
freeing slab $1

All the frees occur in an opslab_force_free() call. I think the issue is that op_free() relies on following the op_next and op_sibparent pointers even through already freed ops. The result is that after the patch, op_free($10), the last op to be freed (a LISTOP), does not actually end op hitting CALL_OPFREEHOOK() and that's what's causing the extra refcount.

On entry to opfree the structure is like:

o = $10
  next = $8
    next = 0 /* before patch */
    next = $4 /* after patch */
    sibparent = $3
  sibparent = 0

tonycoz commented 4 years ago

All the frees occur in an opslab_force_free() call. I think the issue is that op_free() relies on following the op_next and op_sibparent pointers even through already freed ops. The result is that after the patch, op_free($10), the last op to be freed (a LISTOP), does not actually end op hitting CALL_OPFREEHOOK() and that's what's causing the extra refcount.

On entry to opfree the structure is like:
o = $10
  next = $8
    next = 0 /* before patch */
    next = $4 /* after patch */
    sibparent = $3
  sibparent = 0

That seems kind of broken, since the OP free list is using op_next in blead, even without my change.

My new branch (in #17583 seems to avoid the problem, and avoids most of the cost of adding to the free list that my original attempt had.

jkeenan commented 4 years ago

All the frees occur in an opslab_force_free() call. I think the issue is that op_free() relies on following the op_next and op_sibparent pointers even through already freed ops. The result is that after the patch, op_free($10), the last op to be freed (a LISTOP), does not actually end op hitting CALL_OPFREEHOOK() and that's what's causing the extra refcount. On entry to opfree the structure is like:
o = $10
  next = $8
    next = 0 /* before patch */
    next = $4 /* after patch */
    sibparent = $3
  sibparent = 0
That seems kind of broken, since the OP free list is using op_next in blead, even without my change.

My new branch (in #17583 seems to avoid the problem, and avoids most of the cost of adding to the free list that my original attempt had.

Building a perl from that branch and installing Number::Phone against it ... now we once again have a reasonable time for running make test:

[Number-Phone-3.6001] 570 $ make test
Skip blib/lib/auto/share/dist/Number-Phone/Number-Phone-UK-Data.db (unchanged)
Skip blib/lib/auto/share/dist/Number-Phone/Number-Phone-NANP-Data.db (unchanged)
PERL_DL_NONLAZY=1 "/home/jkeenan/testing/smoke-me/jkeenan/tonyc/ghpr-17583/bin/perl" "-MExtUtils::Command::MM" "-MTest::Harness" "-e" "undef *Test::Harness::Switches; test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/bugfix-rt48581.t .................................. ok   
t/bugfix-rt61177.t .................................. ok    
t/bugfix_N-P-New-CC-number.t ........................ ok   
t/country-for-subclass.t ............................ ok   
t/country-test-ar.t ................................. ok    
t/country-test-at.t ................................. ok    
t/country-test-br.t ................................. ok    
t/country-test-sm.t ................................. ok    
t/dial_to.t ......................................... ok    
t/example-phone-numbers.t ........................... ok      
t/find-right-subclass.t ............................. ok   
t/formatters.t ...................................... ok   
t/libphonenumber.t .................................. ok     
t/nanp.t ............................................ ok     
t/no-stubs.t ........................................ ok   
t/number-phone-country-idd.t ........................ ok   
t/number-phone-country-multiple.t ................... ok   
t/number-phone-country-prefix.t ..................... ok     
t/number-phone-country-uk.t ......................... ok   
t/number-phone-country-warnings.t ................... ok   
t/number-phone-country.t ............................ ok     
t/pod-coverage.t .................................... skipped: Test::Pod::Coverage 1.08 required for testing POD coverage
t/pod.t ............................................. skipped: Test::Pod 1.18 required for testing POD
t/stubs.t ........................................... ok     
t/superclass_and_uk_different_method_invocations.t .. ok    
t/uk_data.t ......................................... ok     
t/uk_slurp.t ........................................ skipped: slurping is too slow so skipping under Devel::Cover and for normal installs, set AUTOMATED_TESTING to run this
t/unicode.t ......................................... ok   
t/weird_little_islands.t ............................ ok    
All tests successful.
Files=29, Tests=6255, 15 wallclock secs ( 0.58 usr  0.07 sys + 13.49 cusr  0.93 csys = 15.07 CPU)
Result: PASS

Perl / perl5

Number-Phone-3.6001: performance drop since 5.31.6 #17555