Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.99k stars 559 forks source link

run loop optimization #15836

Open p5pRT opened 7 years ago

p5pRT commented 7 years ago

Migrated from rt.perl.org#130643 (status was 'open')

Searchable as RT130643$

p5pRT commented 7 years ago

From @hvds

Created by @hvds

Back in [perl #73480] (gh #10222) I noticed that my gcc at the time (4.4.2) was failing to achieve a simple optimization, and we estimated at the time that tricking gcc into achieving it via 339aac22c2 made perl programs built with it about 1% faster overall. For some reason at the time there was no discussion of the effect this might have for other compilers.

Since then gcc has moved on, and my current install (4.8.4) does not need the trickery. In the meantime, eb578fdb55 implemented a blanket removal of all register declarations, as a result of which the trickery could potentially result in a similar or greater slowdown with any compiler that failed to see it could be elided.

So I'm not sure what the best state for the runloop would be right now, but I suspect we should just remove the extra variable:

--- a/run.c
+++ b/run.c
@@ -36,10 +36,9 @@
 int
 Perl_runops_standard(pTHX)
 {
-    OP *op = PL_op;
-    PERL_DTRACE_PROBE_OP(op);
-    while ((PL_op = op = op->op_ppaddr(aTHX))) {
-        PERL_DTRACE_PROBE_OP(op);
+    PERL_DTRACE_PROBE_OP(PL_op);
+    while ((PL_op = PL_op->op_ppaddr(aTHX))) {
+        PERL_DTRACE_PROBE_OP(PL_op);
     }
     PERL_ASYNC_CHECK();

Locally, a build with ./Configure -des -Doptimize='-g -O6' now makes my main loop in Perl_runops_standard simply:

loop:
    callq  *16(%rax)              # op_ppaddr
    test   %rax, %rax
    mov    %rax, 0x344913(%rip)   # PL_op
    jne    loop

.. both with and without the change. With threads in the mix I get (again with or without the patch):

loop:
    mov    %rbx, %rdi
    callq  *16(%rax)
    test   %rax, %rax
    mov    %rax, 8(%rbx)
    jne    loop

With -O0 there's unsurprisingly a noticeable improvement with the patch, but I guess that's only really of interest as a clue to what older or less developed compilers might be doing.

It'd be useful for people with different compilers or significantly different platforms to do similar tests.

(Meta: should we add a 'severity=optimization'?)

Perl Info ``` Flags: category=core severity=Wishlist Site configuration information for perl 5.25.10: Configured by hv at Mon Jan 23 20:29:32 GMT 2017. Summary of my perl5 (revision 5 version 25 subversion 10) configuration: Commit id: e18c4116c82b2027a1e5d4e6b9a7214d60779053 Platform: osname=linux osvers=3.13.0-101-generic archname=x86_64-linux uname='linux shad2 3.13.0-101-generic #148-ubuntu smp thu oct 20 22:08:32 utc 2016 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Dcc=gcc -Dprefix=/opt/blead-d0 -Doptimize=-g -O0 -DDEBUGGING -Dusedevel -Uversiononly' hint=recommended useposix=true d_sigaction=define useithreads=undef usemultiplicity=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n bincompat5005=undef Compiler: cc='gcc' ccflags ='-fwrapv -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2' optimize='-g -O0' cppflags='-fwrapv -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='' gccversion='4.8.4' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='gcc' ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /lib64 /usr/lib64 libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.19.so so=so useshrplib=false libperl=libperl.a gnulibc_version='2.19' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E' cccdlflags='-fPIC' lddlflags='-shared -g -O0 -L/usr/local/lib -fstack-protector' @INC for perl 5.25.10: /opt/blead-d0/lib/perl5/site_perl/5.25.10/x86_64-linux /opt/blead-d0/lib/perl5/site_perl/5.25.10 /opt/blead-d0/lib/perl5/5.25.10/x86_64-linux /opt/blead-d0/lib/perl5/5.25.10 /opt/blead-d0/lib/perl5/site_perl/5.25.8 /opt/blead-d0/lib/perl5/site_perl Environment for perl 5.25.10: HOME=/home/hv LANG=C LANGUAGE=en_GB:en LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH= PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 7 years ago

From @hvds

On Wed\, 25 Jan 2017 06​:23​:26 -0800\, hv wrote​:

So I'm not sure what the best state for the runloop would be right now\, but I suspect we should just remove the extra variable​:

--- a/run.c +++ b/run.c @​@​ -36\,10 +36\,9 @​@​ int Perl_runops_standard(pTHX) { - OP *op = PL_op; - PERL_DTRACE_PROBE_OP(op); - while ((PL_op = op = op->op_ppaddr(aTHX))) { - PERL_DTRACE_PROBE_OP(op); + PERL_DTRACE_PROBE_OP(PL_op); + while ((PL_op = PL_op->op_ppaddr(aTHX))) { + PERL_DTRACE_PROBE_OP(PL_op); } PERL_ASYNC_CHECK();

I've made a branch hv/runloop with this to make it easier for people to experiment with.

Hugo

p5pRT commented 7 years ago

From @jkeenan

With perl-5.26.0 having been released\, this ticket is now up for consideration.

Thank you very much. -- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

From @jkeenan

On Wed\, 25 Jan 2017 14​:23​:26 GMT\, hv wrote​:

This is a bug report for perl from hv@​crypt.org\, generated with the help of perlbug 1.40 running under perl 5.25.10.

----------------------------------------------------------------- [Please describe your issue here]

Back in [perl #73480] I noticed that my gcc at the time (4.4.2) was failing to achieve a simple optimization\, and we estimated at the time that tricking gcc into achieving it via 339aac22c2 made perl programs built with it about 1% faster overall. For some reason at the time there was no discussion of the effect this might have for other compilers.

Since then gcc has moved on\, and my current install (4.8.4) does not need the trickery. In the meantime\, eb578fdb55 implemented a blanket removal of all register declarations\, as a result of which the trickery could potentially result in a similar or greater slowdown with any compiler that failed to see it could be elided.

So I'm not sure what the best state for the runloop would be right now\, but I suspect we should just remove the extra variable​:

--- a/run.c +++ b/run.c @​@​ -36\,10 +36\,9 @​@​ int Perl_runops_standard(pTHX) { - OP *op = PL_op; - PERL_DTRACE_PROBE_OP(op); - while ((PL_op = op = op->op_ppaddr(aTHX))) { - PERL_DTRACE_PROBE_OP(op); + PERL_DTRACE_PROBE_OP(PL_op); + while ((PL_op = PL_op->op_ppaddr(aTHX))) { + PERL_DTRACE_PROBE_OP(PL_op); } PERL_ASYNC_CHECK();

Locally\, a build with `./Configure -des -Doptimize='-g -O6'` now makes my main loop in Perl_runops_standard simply​: loop​: callq *16(%rax) # op_ppaddr test %rax\, %rax mov %rax\, 0x344913(%rip) # PL_op jne loop .. both with and without the change. With threads in the mix I get (again with or without the patch)​: loop​: mov %rbx\, %rdi callq *16(%rax) test %rax\, %rax mov %rax\, 8(%rbx) jne loop

How do you get this data and what is its significance?

I ask so that we can try this out on various platforms/configurations.

With -O0 there's unsurprisingly a noticeable improvement with the patch\, but I guess that's only really of interest as a clue to what older or less developed compilers might be doing.

It'd be useful for people with different compilers or significantly different platforms to do similar tests.

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 7 years ago

From @hvds

On Mon\, 05 Jun 2017 11​:22​:31 -0700\, jkeenan wrote​:

How do you get this data and what is its significance?

I got it from gdb\, then cleaned it up for legibility. You'll need a perl compiled with -g\, then​:

% gdb /path/to/perl (gdb) disass Perl_runops_standard Dump of assembler code for function Perl_runops_standard​:   0x000000000062ee20 \<+0>​: sub $0x8\,%rsp   0x000000000062ee24 \<+4>​: mov 0x2ef335(%rip)\,%rax # 0x91e160 \<PL_op>   0x000000000062ee2b \<+11>​: nopl 0x0(%rax\,%rax\,1)   0x000000000062ee30 \<+16>​: callq *0x10(%rax)   0x000000000062ee33 \<+19>​: test %rax\,%rax   0x000000000062ee36 \<+22>​: mov %rax\,0x2ef323(%rip) # 0x91e160 \<PL_op>   0x000000000062ee3d \<+29>​: jne 0x62ee30 \<Perl_runops_standard+16>   0x000000000062ee3f \<+31>​: mov 0x2ee523(%rip)\,%eax # 0x91d368 \<PL_sig_pending>   0x000000000062ee45 \<+37>​: test %eax\,%eax   0x000000000062ee47 \<+39>​: jne 0x62ee57 \<Perl_runops_standard+55>   0x000000000062ee49 \<+41>​: movb $0x0\,0x2eece8(%rip) # 0x91db38 \<PL_tainted>   0x000000000062ee50 \<+48>​: xor %eax\,%eax   0x000000000062ee52 \<+50>​: add $0x8\,%rsp   0x000000000062ee56 \<+54>​: retq
  0x000000000062ee57 \<+55>​: callq *0x2ec93b(%rip) # 0x91b798 \<PL_signalhook>   0x000000000062ee5d \<+61>​: jmp 0x62ee49 \<Perl_runops_standard+41> End of assembler dump. (gdb) ^D %

I actually get this from a handy (and hacky) script of mine (attached) that invokes gdb with expect(1) and does much of the cleanup I find useful​:

% disfunc /path/to/perl Perl_runops_standard Perl_runops_standard​:   sub $8\, %rsp   mov 0x2ef335(%rip)\, %rax # PL_op   nopl 0(%rax\, %rax\, 1) l0​:   callq *16(%rax)   test %rax\, %rax   mov %rax\, 0x2ef323(%rip) # PL_op   jne l0   mov 0x2ee523(%rip)\, %eax # PL_sig_pending   test %eax\, %eax   jne l1 l2​:   movb $0\, 0x2eece8(%rip) # PL_tainted   xor %eax\, %eax   add $8\, %rsp   retq l1​:   callq *0x2ec93b(%rip) # PL_signalhook   jmp l2 %

Other platforms often come with their own debugger\, with their own incantations and their own output formats​: you'd need to read the appropriate docs on how to drive them. But this function is simple enough that anyone with assembler experience should be able to figure out pretty much what's going on from whatever output they produce.

As for the significance​: this is the central loop of perl's runtime. If a particular compiler is not compiling it to optimal assembler\, users of a perl built with that compiler will be paying the price of that with every op of every program they run. But it was unwise to mess with it back in #73480 for the benefit of a particular version of a particular compiler without considering the effect that might have on others\, and I'd like to avoid making that mistake again. :)

Hugo

p5pRT commented 7 years ago

From @hvds

disfunc

hvds commented 3 years ago

I've rebased this on latest blead, I still think we should consider doing this.

@nwc10, you might find this one interesting.

richardleach commented 3 years ago

Since many platforms now pass (the first n) function arguments in registers rather than via the stack, and all OPs are going to want the value of op/PL_op, I queried on IRC whether it might be beneficial to change the PP calling convention to: op->op_ppaddr(aTHX_, op) .

Otherwise - AFAICT - we have (in the runloop) the OP in a register but save it to (effectively) L1 cache, call the function, which then loads the OP back out of L1 into a register.

I'm not sure what the main compilers would produce if that was instead: op->op_ppaddr(aTHX_, PL_op)

(Might even push it to op->op_ppaddr(aTHX_, PL_op, PL_stack_sp), since almost every OP needs the SP right away.)

hvds commented 3 years ago

@richardleach this has gone a bit over 4 years so far without a hint that anyone feels empowered/interested to consider this, so I'd be inclined to wait and see if that changes before considering any extension to it.