Open p5pRT opened 7 years ago
Back in [perl #73480] (gh #10222) I noticed that my gcc at the time (4.4.2) was failing to achieve a simple optimization, and we estimated at the time that tricking gcc into achieving it via 339aac22c2 made perl programs built with it about 1% faster overall. For some reason at the time there was no discussion of the effect this might have for other compilers.
Since then gcc has moved on, and my current install (4.8.4) does not need the trickery. In the meantime, eb578fdb55 implemented a blanket removal of all register declarations, as a result of which the trickery could potentially result in a similar or greater slowdown with any compiler that failed to see it could be elided.
So I'm not sure what the best state for the runloop would be right now, but I suspect we should just remove the extra variable:
--- a/run.c
+++ b/run.c
@@ -36,10 +36,9 @@
int
Perl_runops_standard(pTHX)
{
- OP *op = PL_op;
- PERL_DTRACE_PROBE_OP(op);
- while ((PL_op = op = op->op_ppaddr(aTHX))) {
- PERL_DTRACE_PROBE_OP(op);
+ PERL_DTRACE_PROBE_OP(PL_op);
+ while ((PL_op = PL_op->op_ppaddr(aTHX))) {
+ PERL_DTRACE_PROBE_OP(PL_op);
}
PERL_ASYNC_CHECK();
Locally, a build with ./Configure -des -Doptimize='-g -O6'
now makes my main loop in Perl_runops_standard simply:
loop:
callq *16(%rax) # op_ppaddr
test %rax, %rax
mov %rax, 0x344913(%rip) # PL_op
jne loop
.. both with and without the change. With threads in the mix I get (again with or without the patch):
loop:
mov %rbx, %rdi
callq *16(%rax)
test %rax, %rax
mov %rax, 8(%rbx)
jne loop
With -O0 there's unsurprisingly a noticeable improvement with the patch, but I guess that's only really of interest as a clue to what older or less developed compilers might be doing.
It'd be useful for people with different compilers or significantly different platforms to do similar tests.
(Meta: should we add a 'severity=optimization'?)
On Wed\, 25 Jan 2017 06:23:26 -0800\, hv wrote:
So I'm not sure what the best state for the runloop would be right now\, but I suspect we should just remove the extra variable:
--- a/run.c +++ b/run.c @@ -36\,10 +36\,9 @@ int Perl_runops_standard(pTHX) { - OP *op = PL_op; - PERL_DTRACE_PROBE_OP(op); - while ((PL_op = op = op->op_ppaddr(aTHX))) { - PERL_DTRACE_PROBE_OP(op); + PERL_DTRACE_PROBE_OP(PL_op); + while ((PL_op = PL_op->op_ppaddr(aTHX))) { + PERL_DTRACE_PROBE_OP(PL_op); } PERL_ASYNC_CHECK();
I've made a branch hv/runloop with this to make it easier for people to experiment with.
Hugo
With perl-5.26.0 having been released\, this ticket is now up for consideration.
Thank you very much. -- James E Keenan (jkeenan@cpan.org)
On Wed\, 25 Jan 2017 14:23:26 GMT\, hv wrote:
This is a bug report for perl from hv@crypt.org\, generated with the help of perlbug 1.40 running under perl 5.25.10.
----------------------------------------------------------------- [Please describe your issue here]
Back in [perl #73480] I noticed that my gcc at the time (4.4.2) was failing to achieve a simple optimization\, and we estimated at the time that tricking gcc into achieving it via 339aac22c2 made perl programs built with it about 1% faster overall. For some reason at the time there was no discussion of the effect this might have for other compilers.
Since then gcc has moved on\, and my current install (4.8.4) does not need the trickery. In the meantime\, eb578fdb55 implemented a blanket removal of all register declarations\, as a result of which the trickery could potentially result in a similar or greater slowdown with any compiler that failed to see it could be elided.
So I'm not sure what the best state for the runloop would be right now\, but I suspect we should just remove the extra variable:
--- a/run.c +++ b/run.c @@ -36\,10 +36\,9 @@ int Perl_runops_standard(pTHX) { - OP *op = PL_op; - PERL_DTRACE_PROBE_OP(op); - while ((PL_op = op = op->op_ppaddr(aTHX))) { - PERL_DTRACE_PROBE_OP(op); + PERL_DTRACE_PROBE_OP(PL_op); + while ((PL_op = PL_op->op_ppaddr(aTHX))) { + PERL_DTRACE_PROBE_OP(PL_op); } PERL_ASYNC_CHECK();
Locally\, a build with `./Configure -des -Doptimize='-g -O6'` now makes my main loop in Perl_runops_standard simply: loop: callq *16(%rax) # op_ppaddr test %rax\, %rax mov %rax\, 0x344913(%rip) # PL_op jne loop .. both with and without the change. With threads in the mix I get (again with or without the patch): loop: mov %rbx\, %rdi callq *16(%rax) test %rax\, %rax mov %rax\, 8(%rbx) jne loop
How do you get this data and what is its significance?
I ask so that we can try this out on various platforms/configurations.
With -O0 there's unsurprisingly a noticeable improvement with the patch\, but I guess that's only really of interest as a clue to what older or less developed compilers might be doing.
It'd be useful for people with different compilers or significantly different platforms to do similar tests.
Thank you very much.
-- James E Keenan (jkeenan@cpan.org)
The RT System itself - Status changed from 'new' to 'open'
On Mon\, 05 Jun 2017 11:22:31 -0700\, jkeenan wrote:
How do you get this data and what is its significance?
I got it from gdb\, then cleaned it up for legibility. You'll need a perl compiled with -g\, then:
% gdb /path/to/perl
(gdb) disass Perl_runops_standard
Dump of assembler code for function Perl_runops_standard:
0x000000000062ee20 \<+0>: sub $0x8\,%rsp
0x000000000062ee24 \<+4>: mov 0x2ef335(%rip)\,%rax # 0x91e160 \<PL_op>
0x000000000062ee2b \<+11>: nopl 0x0(%rax\,%rax\,1)
0x000000000062ee30 \<+16>: callq *0x10(%rax)
0x000000000062ee33 \<+19>: test %rax\,%rax
0x000000000062ee36 \<+22>: mov %rax\,0x2ef323(%rip) # 0x91e160 \<PL_op>
0x000000000062ee3d \<+29>: jne 0x62ee30 \<Perl_runops_standard+16>
0x000000000062ee3f \<+31>: mov 0x2ee523(%rip)\,%eax # 0x91d368 \<PL_sig_pending>
0x000000000062ee45 \<+37>: test %eax\,%eax
0x000000000062ee47 \<+39>: jne 0x62ee57 \<Perl_runops_standard+55>
0x000000000062ee49 \<+41>: movb $0x0\,0x2eece8(%rip) # 0x91db38 \<PL_tainted>
0x000000000062ee50 \<+48>: xor %eax\,%eax
0x000000000062ee52 \<+50>: add $0x8\,%rsp
0x000000000062ee56 \<+54>: retq
0x000000000062ee57 \<+55>: callq *0x2ec93b(%rip) # 0x91b798 \<PL_signalhook>
0x000000000062ee5d \<+61>: jmp 0x62ee49 \<Perl_runops_standard+41>
End of assembler dump.
(gdb) ^D
%
I actually get this from a handy (and hacky) script of mine (attached) that invokes gdb with expect(1) and does much of the cleanup I find useful:
% disfunc /path/to/perl Perl_runops_standard Perl_runops_standard: sub $8\, %rsp mov 0x2ef335(%rip)\, %rax # PL_op nopl 0(%rax\, %rax\, 1) l0: callq *16(%rax) test %rax\, %rax mov %rax\, 0x2ef323(%rip) # PL_op jne l0 mov 0x2ee523(%rip)\, %eax # PL_sig_pending test %eax\, %eax jne l1 l2: movb $0\, 0x2eece8(%rip) # PL_tainted xor %eax\, %eax add $8\, %rsp retq l1: callq *0x2ec93b(%rip) # PL_signalhook jmp l2 %
Other platforms often come with their own debugger\, with their own incantations and their own output formats: you'd need to read the appropriate docs on how to drive them. But this function is simple enough that anyone with assembler experience should be able to figure out pretty much what's going on from whatever output they produce.
As for the significance: this is the central loop of perl's runtime. If a particular compiler is not compiling it to optimal assembler\, users of a perl built with that compiler will be paying the price of that with every op of every program they run. But it was unwise to mess with it back in #73480 for the benefit of a particular version of a particular compiler without considering the effect that might have on others\, and I'd like to avoid making that mistake again. :)
Hugo
I've rebased this on latest blead, I still think we should consider doing this.
@nwc10, you might find this one interesting.
Since many platforms now pass (the first n) function arguments in registers rather than via the stack, and all OPs are going to want the value of op
/PL_op
, I queried on IRC whether it might be beneficial to change the PP calling convention to:
op->op_ppaddr(aTHX_, op)
.
Otherwise - AFAICT - we have (in the runloop) the OP in a register but save it to (effectively) L1 cache, call the function, which then loads the OP back out of L1 into a register.
I'm not sure what the main compilers would produce if that was instead:
op->op_ppaddr(aTHX_, PL_op)
(Might even push it to op->op_ppaddr(aTHX_, PL_op, PL_stack_sp)
, since almost every OP needs the SP right away.)
@richardleach this has gone a bit over 4 years so far without a hint that anyone feels empowered/interested to consider this, so I'd be inclined to wait and see if that changes before considering any extension to it.
Migrated from rt.perl.org#130643 (status was 'open')
Searchable as RT130643$