Open KJ7LNW opened 1 year ago
Notes on profiling C/XS code: https://www.perlmonks.org/?node_id=791611
With the above, with_time { sequence(1e2) for 0..6e4 }
went from (on my machine) 850ms to 790ms.
Ironically, the topdl
change, which is causing a segfault thanks to a fault in the var-args stuff for barf
, doesn't make sequence
go any faster once the inplace
change is made (since the XS inplace
obviously doesn't use topdl
at all), which at this writing is the only one on master
, and is getting tried a third time against downstream. With that, the above snippet takes more like 690ms.
With the changes linked above, on my machine it's now about twice as quick (450ms) as when I started.
With the above-linked commit, this is now taking about 380ms here.
Notes on identifying this and probably similar performance bottlenecks, using cachegrind
:
$ rm cachegrind.out.*
$ make core && valgrind --tool=cachegrind perl -Mblib -MPDL -e 'sequence(1e2) for 1..1e4'
$ cg_annotate cachegrind.out.*
[...]
113,127,912 (14.74%) 463,865 ( 2.48%) 89 ( 0.59%) 40,271,896 (18.65%) 819,306 (13.70%) 698 ( 1.61%) 13,008,932 (11.12%) 47,211 ( 5.03%) 23 ( 0.01%) ???:Perl_hv_common
28,971,054 ( 3.77%) 118,977 ( 0.64%) 140 ( 0.93%) 6,323,461 ( 2.93%) 218,400 ( 3.65%) 314 ( 0.73%) 3,345,234 ( 2.86%) 7,931 ( 0.84%) 511 ( 0.30%) ???:Perl_yyparse
[...]
8,390,000 ( 1.09%) 110,000 ( 0.59%) 11 ( 0.07%) 560,000 ( 0.26%) 20,001 ( 0.33%) 2 ( 0.00%) 1,090,000 ( 0.93%) 0 0 /home/osboxes/pdl-code/Basic/Primitive/pp-axisvalues.c:pdl_axisvalues_readdata
[...]
The first two are the top-two expensive functions, the most so being the unexpected Perl_hv_common
at nearly 15%. The first directly PDL-defined one, the readdata
one, only uses 1.1% of CPU. To find out what was calling that took a bit of trickery, on the suspicion that it was SvPDLV
that was slow, running inplace
twice and inserting a break on Perl_hv_common
after the first one, since Perl_hv_common
gets run during Perl startup:
$ make core && gdb perl -ex 'break XS_PDL_inplace' -ex 'run -Mblib -MPDL -e "pdl(5)->inplace for 1..2"'
[...]
Function "XS_PDL_inplace" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
[...]
(gdb) break Perl_hv_common
Breakpoint 2 at 0x55555564c840
(gdb) c
Continuing.
Breakpoint 2, 0x000055555564c840 in Perl_hv_common ()
(gdb) bt
#0 0x000055555564c840 in Perl_hv_common ()
#1 0x00005555556d9f76 in S_isa_lookup ()
#2 0x00005555556da129 in S_sv_derived_from_svpvn ()
#3 0x00007ffff7333e94 in pdl_SvPDLV (sv=sv@entry=0x5555558c4600) at pdlcore.c:37
#4 0x00007ffff7300c85 in XS_PDL_inplace (cv=<optimised out>) at /home/osboxes/pdl-code/Basic/Core/Core.xs:192
[...]
(gdb) frame 3
#3 0x00007ffff7333e94 in pdl_SvPDLV (sv=sv@entry=0x5555558c4600) at pdlcore.c:37
37 if(sv_derived_from(sv, "PDL") && !SvROK(sv)) {
That confirmed it was indeed coming from SvPDLV
, which calls the function as part of an inheritance lookup. This is how long 10m inplace
calls take before the immediate change:
$ make core && time perl -Mblib -MPDL -e '$p = pdl(5); $p->inplace for 1..1e7'
real 0m3.776s
[...]
real 0m3.663s
[...]
real 0m3.529s
This is how long after, about 20% less time:
real 0m2.871s
[...]
real 0m2.849s
[...]
real 0m2.895s
With that commit, the 10m inplace
now take about 1.2s, about 3 times quicker than before today.
The sequence
stuff now takes about 320ms.
For more performance issues, see the benchmark on https://github.com/nrdvana/perl-Math-3Space/pull/8. The stripped-down timings:
pdl> $N = 1e6
pdl> with_time { $vecs_1pdl = pdl(([1,1,1]) x $N) }
1260.28 ms
pdl> p $basis = identity(3)
[
[1 0 0]
[0 1 0]
[0 0 1]
]
pdl> with_time { $vecs_1pdl x $basis }
40.527 ms
pdl> with_time { $vecs_1pdl x $basis }
46.82 ms
pdl> with_time { $vecs_1pdl x $basis }
21.061 ms
pdl> p $vecs_1pdl->info
PDL: Double D [3,1000000]
pdl> use PDL::LinearAlgebra
pdl> with_time { $vecs_1pdl x $basis }
17.646 ms
pdl> with_time { $vecs_1pdl x $basis }
12.661 ms
pdl> with_time { $vecs_1pdl x $basis }
12.916 ms
pdl> with_time { $vecs_1pdl x $basis }
13.946 ms
Here, PDL::LinearAlgebra does make things quicker.
I was surprised to find that the
sequence()
function was the bottleneck in a Particle Swarm optimization. Ultimately I cached$sequence = sequence(@foo)
instead of generating it every time it was used.Not sure if there is room to optimize
sequence
or not, but here is the nytprof output.Before caching:
After caching:
Caching fixed it for me in my own program, but FYI in case it helps others, or if it can be addressed somehow.