performance bug: perl Thread::Queue is 20x slower than Unix pipe

Perl / perl5

🐪 The Perl programming language

https://dev.perl.org/perl5/

Other

1.91k stars 543 forks source link

performance bug: perl Thread::Queue is 20x slower than Unix pipe #13196

Open p5pRT opened 11 years ago

p5pRT commented 11 years ago

Migrated from rt.perl.org#119445 (status was 'open')

Searchable as RT119445$

p5pRT commented 11 years ago

From johnh@isi.edu

Created by johnh@isi.edu

This is a bug report for perl from johnh@isi.edu\, generated with the help of perlbug 1.39 running under perl 5.16.3.

-----------------------------------------------------------------

Why is Thread::Queue *so* slow?

I understand it has to do locking and be careful about data structures\, but it seems like it is about 20x slower than opening up a Unix pipe\, printing to that\, reading it back and parsing the result.

Thread::Queue is correct\, but I suggest that 20x slower is a performance bug.

One would think that IPC through memory would be at least as fast as a pipe through the kernel\, and ideally it should be faster.

Here's timing of a test program that sends 500k integers between two threads\, using Thread::Queue or pipe(2).

$ ./thread_ipc_perf.pl -m queue benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @ 0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @ 1.69/s (n=1)

Here's a larger run (1M integers) with the same kind of results.

$ ./thread_ipc_perf.pl -N 1000000 -m queue benchmark took 30 wallclock secs (32.69 usr + 6.06 sys = 38.75 CPU) @ 0.03/s (n=1)

$ ./thread_ipc_perf.pl -N 1000000 -m pipe benchmark took 1 wallclock secs ( 1.23 usr + 0.00 sys = 1.23 CPU) @ 0.81/s (n=1)

Source code for the above simple benchmark is at http://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt

We can quibble over the exact multiplier (maybe it's only 15x slower)\, but it's *really* slow.

Any suggestions? I get similar results if I simplify Thread::Queue to bare minimum code.

To speculate\, I'm thinking the cost is in making all IPC data shared. It would be great if one could have data that is sent over Thread::Queue that is copied\, not shared.

Thanks for any suggestions\, -John Heidemann

Perl Info

``` Flags: category=library severity=medium module=Thread::Queue Site configuration information for perl 5.16.3: Configured by Red Hat, Inc. at Tue Jun 18 09:17:09 UTC 2013. Summary of my perl5 (revision 5 version 16 subversion 3) configuration: Platform: osname=linux, osvers=2.6.32-358.2.1.el6.x86_64, archname=x86_64-linux-thread-multi uname='linux buildvm-05.phx2.fedoraproject.org 2.6.32-358.2.1.el6.x86_64 #1 smp wed feb 20 12:17:37 est 2013 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Dccdlflags=-Wl,--enable-new-dtags -Dlddlflags=-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro -DDEBUGGING=-g -Dversion=5.16.3 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_sysl! og -Dman3 ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic', cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.8.1 20130603 (Red Hat 4.8.1-1)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags =' -fstack-protector' libpth=/usr/local/lib64 /lib64 /usr/lib64 libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbvm_compat perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.17' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-rpath,/usr/lib64/perl5/CORE' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro ' Locally applied patches: @INC for perl 5.16.3: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . Environment for perl 5.16.3: HOME=/home/johnh LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH=/usr/local/lib LOGDIR (unset) PATH=/bin:/usr/bin:/usr/local/sbin:/etc:/sbin:/usr/sbin PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 11 years ago

From @jkeenan

On Fri Aug 23 17:28:00 2013\, johnh@isi.edu wrote:

This is a bug report for perl from johnh@isi.edu\, generated with the help of perlbug 1.39 running under perl 5.16.3.

-----------------------------------------------------------------

Why is Thread::Queue *so* slow?

I understand it has to do locking and be careful about data structures\, but it seems like it is about 20x slower than opening up a Unix pipe\, printing to that\, reading it back and parsing the result.

Thread::Queue is correct\, but I suggest that 20x slower is a performance bug.

One would think that IPC through memory would be at least as fast as a pipe through the kernel\, and ideally it should be faster.

Here's timing of a test program that sends 500k integers between two threads\, using Thread::Queue or pipe(2).

$ ./thread_ipc_perf.pl -m queue benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @ 0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @ 1.69/s (n=1)

Here's a larger run (1M integers) with the same kind of results.

$ ./thread_ipc_perf.pl -N 1000000 -m queue benchmark took 30 wallclock secs (32.69 usr + 6.06 sys = 38.75 CPU) @ 0.03/s (n=1)

$ ./thread_ipc_perf.pl -N 1000000 -m pipe benchmark took 1 wallclock secs ( 1.23 usr + 0.00 sys = 1.23 CPU) @ 0.81/s (n=1)

Source code for the above simple benchmark is at http://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt

We can quibble over the exact multiplier (maybe it's only 15x slower)\, but it's *really* slow.

Any suggestions? I get similar results if I simplify Thread::Queue to bare minimum code.

To speculate\, I'm thinking the cost is in making all IPC data shared. It would be great if one could have data that is sent over Thread::Queue that is copied\, not shared.

Thanks for any suggestions\, -John Heidemann

[Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=library severity=medium module=Thread::Queue --- Site configuration information for perl 5.16.3:

Configured by Red Hat\, Inc. at Tue Jun 18 09:17:09 UTC 2013.

Summary of my perl5 (revision 5 version 16 subversion 3) configuration:

Platform: osname=linux\, osvers=2.6.32-358.2.1.el6.x86_64\, archname=x86_64- linux-thread-multi uname='linux buildvm-05.phx2.fedoraproject.org 2.6.32- 358.2.1.el6.x86_64 #1 smp wed feb 20 12:17:37 est 2013 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Dccdlflags=-Wl\,--enable-new-dtags -Dlddlflags=-shared -O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl\,-z\,relro -DDEBUGGING=-g -Dversion=5.16.3 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat\, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_sysl! og -Dman3 ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'

That's a lot of configuration options. While I don't doubt that you have a reason for all of them\, I also doubt that many people are going to want to build a perl with all those options just for the purpose of testing your claim.

Would it be possible for you to try this again with the absolute minimum number of configuration options required to build a threaded perl which manifests the problem?

Thank you very much. Jim Keenan

p5pRT commented 11 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 11 years ago

From @iabyn

On Sun\, Aug 25\, 2013 at 05:37:39PM -0700\, James E Keenan via RT wrote:

On Fri Aug 23 17:28:00 2013\, johnh@isi.edu wrote:

Why is Thread::Queue *so* slow?

I understand it has to do locking and be careful about data structures\, but it seems like it is about 20x slower than opening up a Unix pipe\, printing to that\, reading it back and parsing the result.

Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes\, and read and writes chunks of them into a shared buffer.

A T::Q buffer takes a stream of perl "things"\, which might be objects or other such complex structures\, and ensures they they are accessible by both the originating thread and any potential consumer thread. Migrating a perl "thing" across a thread boundary is considerably more complex than copying a byte across.

To speculate\, I'm thinking the cost is in making all IPC data shared. It would be great if one could have data that is sent over Thread::Queue that is copied\, not shared.

But T::Q is build upon a shared array\, and is designed to handled shared data.

I think the performance you are seeing is the performance I would expect\, and that this is not a bug.

-- In England there is a special word which means the last sunshine of the summer. That word is "spring".

p5pRT commented 11 years ago

From johnh@isi.edu

On Mon\, 26 Aug 2013 08:11:12 -0700\, "Dave Mitchell via RT" wrote:

On Sun\, Aug 25\, 2013 at 05:37:39PM -0700\, James E Keenan via RT wrote:

On Fri Aug 23 17:28:00 2013\, johnh@isi.edu wrote:

Why is Thread::Queue *so* slow?

I understand it has to do locking and be careful about data structures\, but it seems like it is about 20x slower than opening up a Unix pipe\, printing to that\, reading it back and parsing the result.

Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes\, and read and writes chunks of them into a shared buffer.

A T::Q buffer takes a stream of perl "things"\, which might be objects or other such complex structures\, and ensures they they are accessible by both the originating thread and any potential consumer thread. Migrating a perl "thing" across a thread boundary is considerably more complex than copying a byte across.

To speculate\, I'm thinking the cost is in making all IPC data shared. It would be great if one could have data that is sent over Thread::Queue that is copied\, not shared.

But T::Q is build upon a shared array\, and is designed to handled shared data.

I think the performance you are seeing is the performance I would expect\, and that this is not a bug.

I understand that Thread::Queue and perl threads allow shared data\, and that that's much more than a pipe.

My concern is that Thread::Queue also *forces* shared data\, even when it's not rqeuired. If that sharing comes with a 20x performance hit\, that should be clear.

From perlthrtut\, the "Pipeline" model

The pipeline model divides up a task into a series of steps\, and passes the results of one step on to the thread processing the next. Each thread does one thing to each piece of data and passes the results to the next thread in line.

For the pipeline model\, one does not need repeated sharing\, just a one-time hand-off. Each queue is FIFO with data touched by only one thread at a time. That's exactly what my particular applications needs to do.

But one does not *want* sharing (for the pipeline model) there if it's a 20x performance hit.

If the statement is that queues should require shared data and the corresponding performance hit\, that's a design choice one could make. Then I'd suggest the bug becomes: perlthrtut should say "don't use Thread::Queue for the pipeline model if you expect high performance\, roll your own IPC".

Alternatively\, I'd love some mechanism to share data between threads that allows a one-time handoff (not repeated sharing) with pipe-like performance. One would *think* that shared memory should be able to be faster than round-tripping through a pipe (with perl parsing and kernel IO). It seems like a shame that perl is forcing full-on sharing since it's slow and not required (in this case).

-John

p5pRT commented 11 years ago

From johnh@isi.edu

On Sun\, 25 Aug 2013 17:37:39 -0700\, "James E Keenan via RT" wrote:

On Fri Aug 23 17:28:00 2013\, johnh@isi.edu wrote:

This is a bug report for perl from johnh@isi.edu\, generated with the help of perlbug 1.39 running under perl 5.16.3.

-----------------------------------------------------------------

Why is Thread::Queue *so* slow? ...

$ ./thread_ipc_perf.pl -m queue benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @ 0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @ 1.69/s (n=1) ...

Source code for the above simple benchmark is at http://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt ...

Site configuration information for perl 5.16.3:

Configured by Red Hat\, Inc. at Tue Jun 18 09:17:09 UTC 2013.

Summary of my perl5 (revision 5 version 16 subversion 3) configuration:

Platform: osname=linux\, osvers=2.6.32-358.2.1.el6.x86_64\, archname=x86_64- linux-thread-multi uname='linux buildvm-05.phx2.fedoraproject.org 2.6.32- 358.2.1.el6.x86_64 #1 smp wed feb 20 12:17:37 est 2013 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Dccdlflags=-Wl\,--enable-new-dtags -Dlddlflags=-shared -O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl\,-z\,relro -DDEBUGGING=-g -Dversion=5.16.3 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat\, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_sysl! og -Dman3 ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'

That's a lot of configuration options. While I don't doubt that you have a reason for all of them\, I also doubt that many people are going to want to build a perl with all those options just for the purpose of testing your claim.

Would it be possible for you to try this again with the absolute minimum number of configuration options required to build a threaded perl which manifests the problem?

Thank you very much. Jim Keenan

Thanks for the reply.

I don't build perl myself\, those are the default configure options for Fedora Linux. (Presumably RHEL its derivatives uses similar builds.)

I can build perl if you really want\, but let me suggest an alternative if you don't mind:

I provided source code to my benchmark program at:

http://www.isi.edu/~johnh/SOFTWARE/FSDB/thread_ipc_perf.pl.txt

and the two invocations that clearly show the difference on my platform:

$ ./thread_ipc_perf.pl -m queue benchmark took 14 wallclock secs (14.71 usr + 2.51 sys = 17.22 CPU) @ 0.06/s (n=1)

$ ./thread_ipc_perf.pl -m pipe benchmark took 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) @ 1.69/s (n=1)

The benchmark is 293 lines long\, but it's mostly POD documentation and boilerplate. Can I suggest you download the benchmark and try those two invocations ("./thread_ipc_perf.pl" -m queue and "./thread_ipc_perf.pl -m pipe") on whatever perl you prefer?

If some other platform or build has much different performance\, I'll take this up with my OS provider.

-John

p5pRT commented 11 years ago

From @ikegami

How does Thread::Queue::Any compare?

On Mon\, Aug 26\, 2013 at 11:58 AM\, John Heidemann \johnh@isi\.edu wrote:

On Mon\, 26 Aug 2013 08:11:12 -0700\, "Dave Mitchell via RT" wrote:

On Sun\, Aug 25\, 2013 at 05:37:39PM -0700\, James E Keenan via RT wrote:

On Fri Aug 23 17:28:00 2013\, johnh@isi.edu wrote:

Why is Thread::Queue *so* slow?

I understand it has to do locking and be careful about data structures\, but it seems like it is about 20x slower than opening up a Unix pipe\, printing to that\, reading it back and parsing the result.

Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes\, and read and writes chunks of them into a shared buffer.

A T::Q buffer takes a stream of perl "things"\, which might be objects or other such complex structures\, and ensures they they are accessible by both the originating thread and any potential consumer thread. Migrating a perl "thing" across a thread boundary is considerably more complex than copying a byte across.

To speculate\, I'm thinking the cost is in making all IPC data shared. It would be great if one could have data that is sent over Thread::Queue that is copied\, not shared.

But T::Q is build upon a shared array\, and is designed to handled shared data.

I think the performance you are seeing is the performance I would expect\, and that this is not a bug.

I understand that Thread::Queue and perl threads allow shared data\, and that that's much more than a pipe.

My concern is that Thread::Queue also *forces* shared data\, even when it's not rqeuired. If that sharing comes with a 20x performance hit\, that should be clear.

From perlthrtut\, the "Pipeline" model
   The pipeline model divides up a task into a series of steps\, and
passes the results of one step on to the thread processing the next. Each thread does one thing to each piece of data and passes the results to the next thread in line.

For the pipeline model\, one does not need repeated sharing\, just a one-time hand-off. Each queue is FIFO with data touched by only one thread at a time. That's exactly what my particular applications needs to do.

But one does not *want* sharing (for the pipeline model) there if it's a 20x performance hit.

If the statement is that queues should require shared data and the corresponding performance hit\, that's a design choice one could make. Then I'd suggest the bug becomes: perlthrtut should say "don't use Thread::Queue for the pipeline model if you expect high performance\, roll your own IPC".

Alternatively\, I'd love some mechanism to share data between threads that allows a one-time handoff (not repeated sharing) with pipe-like performance. One would *think* that shared memory should be able to be faster than round-tripping through a pipe (with perl parsing and kernel IO). It seems like a shame that perl is forcing full-on sharing since it's slow and not required (in this case).

-John

p5pRT commented 11 years ago

From @lizmat

On Aug 26\, 2013\, at 5:58 PM\, John Heidemann \johnh@isi\.edu wrote:

On Mon\, 26 Aug 2013 08:11:12 -0700\, "Dave Mitchell via RT" wrote:

On Sun\, Aug 25\, 2013 at 05:37:39PM -0700\, James E Keenan via RT wrote:

On Fri Aug 23 17:28:00 2013\, johnh@isi.edu wrote:

Why is Thread::Queue *so* slow?

I understand it has to do locking and be careful about data structures\, but it seems like it is about 20x slower than opening up a Unix pipe\, printing to that\, reading it back and parsing the result. Because it is nothing like a UNIX pipe.

A UNIX pipe takes a stream of bytes\, and read and writes chunks of them into a shared buffer.

A T::Q buffer takes a stream of perl "things"\, which might be objects or other such complex structures\, and ensures they they are accessible by both the originating thread and any potential consumer thread. Migrating a perl "thing" across a thread boundary is considerably more complex than copying a byte across.

To speculate\, I'm thinking the cost is in making all IPC data shared. It would be great if one could have data that is sent over Thread::Queue that is copied\, not shared.

But T::Q is build upon a shared array\, and is designed to handled shared data.

I think the performance you are seeing is the performance I would expect\, and that this is not a bug.

I understand that Thread::Queue and perl threads allow shared data\, and that that's much more than a pipe.

My concern is that Thread::Queue also *forces* shared data\, even when it's not rqeuired. If that sharing comes with a 20x performance hit\, that should be clear.

You should realize that the perl ithreads implementation does *not* have any real shared variables at all. Each thread has its own *copy* of the world.

Variables with the :shared trait\, are simply tied() variables to some internal logic that will STORE values in yet another\, hidden thread. And will FETCH them from that hidden thread again when needed. There is some locking involved there\, I would assume. But I think the biggest bottleneck is really that the slow tie() interface is used for shared variables.

The forks module does not do this differently. However\, instead of making a copy of the world each time a thread is started\, the forks module just does a fork() and let's the OS take care of any Copy-On-Write needed. This makes starting a thread *much* faster\, especially if you have something like Moose and its dependencies loaded. Reading and writing shared variables are done by using pipes\, Unix pipes if possible.

Thread::Queue::Any is simply a wrapper around Thread::Queue\, and thus suffers from the same performance issues.

In other words: don't use Perl 5's ithreads for performance\, use it for asynchronous jobs only where not having to wait for something slow\,

Liz

p5pRT commented 11 years ago

From @Leont

On Mon\, Aug 26\, 2013 at 5:58 PM\, John Heidemann \johnh@isi\.edu wrote:

I understand that Thread::Queue and perl threads allow shared data\, and that that's much more than a pipe.

My concern is that Thread::Queue also *forces* shared data\, even when it's not rqeuired. If that sharing comes with a 20x performance hit\, that should be clear.

From perlthrtut\, the "Pipeline" model
   The pipeline model divides up a task into a series of steps\, and
passes the results of one step on to the thread processing the next. Each thread does one thing to each piece of data and passes the results to the next thread in line.

For the pipeline model\, one does not need repeated sharing\, just a one-time hand-off. Each queue is FIFO with data touched by only one thread at a time. That's exactly what my particular applications needs to do.

But one does not *want* sharing (for the pipeline model) there if it's a 20x performance hit.

If the statement is that queues should require shared data and the corresponding performance hit\, that's a design choice one could make. Then I'd suggest the bug becomes: perlthrtut should say "don't use Thread::Queue for the pipeline model if you expect high performance\, roll your own IPC".

Actually I did write a queue implementation for threads::lite that should be a lot faster for simple data structures\, but I never released it as a separate module that could be used with threads.pm.

Alternatively\, I'd love some mechanism to share data between threads that allows a one-time handoff (not repeated sharing) with pipe-like performance. One would *think* that shared memory should be able to be faster than round-tripping through a pipe (with perl parsing and kernel IO). It seems like a shame that perl is forcing full-on sharing since it's slow and not required (in this case).

I don't think that would be faster than a queue\, given perl's memory model (memory has to be owned by a thread\, shared memory has to be be handled manually) a copy or two is necessary anyway.

Leon

p5pRT commented 11 years ago

From @nwc10

On Mon\, Aug 26\, 2013 at 08:58:14AM -0700\, John Heidemann wrote:

My concern is that Thread::Queue also *forces* shared data\, even when it's not rqeuired. If that sharing comes with a 20x performance hit\, that should be clear.

Yes\, I agree that that's a valid concern\, and we could document that better.

As someone rather too close to the code\, it's not easy to pull back far enough to work out where someone reading the documentation for the first time would have expected to have found such a warning.

Do you have a suggestion for where we should document this\, such that you would have read it had it been there? (Even better if you can suggest a suitable change)

Alternatively\, I'd love some mechanism to share data between threads that allows a one-time handoff (not repeated sharing) with pipe-like performance. One would *think* that shared memory should be able to be faster than round-tripping through a pipe (with perl parsing and kernel IO). It seems like a shame that perl is forcing full-on sharing since it's slow and not required (in this case).

Agree\, I'd love this too. It would permit a lot of effective higher level concurrency designs to work*. But sadly I don't believe that Perl 5 will ever be able to provide a performant hand-off mechanism. The internals assume all over that it's safe for any logical read to actually be a write behind the scenes (making it awkward to provide any sort of read-only view of another thread's data)\, and all interpreter data structures are implicitly tied to the interpreter that allocated them\, which would take a massive amount of refactoring to attempt to untangle.

I don't think that this is particularly a Perl problem. I'm not aware of any comparable C-based dynamic language has managed to retrofit true concurrency. CPython still has a GIL (and Unladen Swallow failed to deliver on its design to remove that)\, and my understanding is that Ruby (MRI/YARV) still single-threads its interpreter\, and PHP doesn't even offer threading. If we had a design to steal\, we'd steal it. :-/

Nicholas Clark

* such as the rather nice constructions that Jonathan Worthing demonstrated for Perl 6: http://jnthn.net/papers/2013-yapceu-conc.pdf (Video not yet online)

p5pRT commented 11 years ago

From johnh@isi.edu

On Tue\, 27 Aug 2013 11:18:57 +0100\, Nicholas Clark wrote:

On Mon\, Aug 26\, 2013 at 08:58:14AM -0700\, John Heidemann wrote:

My concern is that Thread::Queue also *forces* shared data\, even when it's not rqeuired. If that sharing comes with a 20x performance hit\, that should be clear.

Yes\, I agree that that's a valid concern\, and we could document that better.

As someone rather too close to the code\, it's not easy to pull back far enough to work out where someone reading the documentation for the first time would have expected to have found such a warning.

Do you have a suggestion for where we should document this\, such that you would have read it had it been there? (Even better if you can suggest a suitable change)

A proposed patch to perlthrtut is attached at the end of this message.

Alternatively\, I'd love some mechanism to share data between threads that allows a one-time handoff (not repeated sharing) with pipe-like performance. One would *think* that shared memory should be able to be faster than round-tripping through a pipe (with perl parsing and kernel IO). It seems like a shame that perl is forcing full-on sharing since it's slow and not required (in this case).

Agree\, I'd love this too. It would permit a lot of effective higher level concurrency designs to work*. But sadly I don't believe that Perl 5 will ever be able to provide a performant hand-off mechanism. The internals assume all over that it's safe for any logical read to actually be a write behind the scenes (making it awkward to provide any sort of read-only view of another thread's data)\, and all interpreter data structures are implicitly tied to the interpreter that allocated them\, which would take a massive amount of refactoring to attempt to untangle.

I don't think that this is particularly a Perl problem. I'm not aware of any comparable C-based dynamic language has managed to retrofit true concurrency. CPython still has a GIL (and Unladen Swallow failed to deliver on its design to remove that)\, and my understanding is that Ruby (MRI/YARV) still single-threads its interpreter\, and PHP doesn't even offer threading. If we had a design to steal\, we'd steal it. :-/

I don't know anything about C-level internals of perl.

I agree these are inherrent in *shared* variables independent of language.

It's too bad there's no way to move data between two threads without making the data shared (other than the move). A one-time copy from thread A to B. C-only programs have done this for ages (see for example\, "The Duality of Memory and Communication in the Implementation of a Multiprocessor Operating System" by Young et al\, ACM SOSP 1987).

What I'll do for now is to get this effect by printing it to pipe and reading it back in through the other end\, but boy what a lot of work on the perl-side that could be hidden inside the C\, both cleaner and hopefully faster.

-John

Inline Patch

```diff --- perlthrtut.pod- 2013-08-27 08:47:16.347167972 -0700 +++ perlthrtut.pod 2013-08-27 08:53:26.159772710 -0700 @@ -465,6 +465,13 @@ data inconsistency and race conditions. Note that Perl will protect its internals from your race conditions, but it won't protect you from you. +=head2 Thread Pitfalls: Performance + +Shared data is and locking expensive, slowing down access. +As of perl 5.18, one should expect sharing data between threads +with tools such as L to be about 15-20x slower +than copying the data through L. + =head1 Synchronization and control Perl provides a number of mechanisms to coordinate the interactions ```

p5pRT commented 11 years ago

From @tamias

On Tue\, Aug 27\, 2013 at 05:15:09PM -0700\, John Heidemann wrote:

--- perlthrtut.pod- 2013-08-27 08:47:16.347167972 -0700 +++ perlthrtut.pod 2013-08-27 08:53:26.159772710 -0700 @@ -465\,6 +465\,13 @@ data inconsistency and race conditions. Note that Perl will protect its internals from your race conditions\, but it won't protect you from you.

+=head2 Thread Pitfalls: Performance + +Shared data is and locking expensive\, slowing down access.

I think this sentence got a bit mixed up.

Ronald

p5pRT commented 11 years ago

From @nwc10

On Tue\, Aug 27\, 2013 at 05:15:09PM -0700\, John Heidemann wrote:

On Tue\, 27 Aug 2013 11:18:57 +0100\, Nicholas Clark wrote:

Do you have a suggestion for where we should document this\, such that you would have read it had it been there? (Even better if you can suggest a suitable change)

A proposed patch to perlthrtut is attached at the end of this message.

Thanks

It's too bad there's no way to move data between two threads without making the data shared (other than the move). A one-time copy from thread A to B. C-only programs have done this for ages (see for example\, "The Duality of Memory and Communication in the Implementation of a Multiprocessor Operating System" by Young et al\, ACM SOSP 1987).

Agree that's it's frustrating.

That paper seems to predate Perl 1 by about 5 weeks\, but I don't think that the complexity trade off to facilitate concurrency became a concern of mainstream development until some point after Perl 5 shipped in 1994. By which time\, of course\, it's too late to add it in from the start. (And the Perl 5 codebase is a rewrite of Perl 4\, which traces history all the way back to Perl 1\, so really it needed to be in by December 1987 to be helpful)

I feel that it's the same fundamental problem as attempting to retrofit Unicode support. Bolting it on later will never work completely - it has to be in the design from the start.

---------------------------------------------------------------------- --- perlthrtut.pod- 2013-08-27 08:47:16.347167972 -0700 +++ perlthrtut.pod 2013-08-27 08:53:26.159772710 -0700 @@ -465\,6 +465\,13 @@ data inconsistency and race conditions. Note that Perl will protect its internals from your race conditions\, but it won't protect you from you.

+=head2 Thread Pitfalls: Performance + +Shared data is and locking expensive\, slowing down access. +As of perl 5.18\, one should expect sharing data between threads +with tools such as L\<Thread::Queue> to be about 15-20x slower +than copying the data through L\<pipe(2)>. + =head1 Synchronization and control

Perl provides a number of mechanisms to coordinate the interactions

On Wed\, Aug 28\, 2013 at 11:30:58PM -0400\, Ronald J Kimball wrote:

I think this sentence got a bit mixed up.

I think also that it should mention your insight about what's not obvious about performance - lack of handoff. I don't think that the performance has changed much historically\, and I foresee a way to change it in the future\, so I think that having a version number in there isn't that useful. So this instead?

Shared data and locking are expensive\, slowing down access. Perl 5 has no way of passing ownership of data between threads\, so all thread operations involve data becoming shared. One should expect sharing data between threads with tools such as L\<Thread::Queue> to be about 15-20x slower than copying the data through L\<pipe(2)>.

If in the future someone does radically improve thread performance\, then I'd expect them to revisit the documentation to update it with new figures (and publicise their success).

Nicholas Clark

p5pRT commented 11 years ago

From @Leont

On Tue\, Aug 27\, 2013 at 12:11 PM\, Leon Timmermans \fawaka@gmail\.com wrote:

Actually I did write a queue implementation for threads::lite that should be a lot faster for simple data structures\, but I never released it as a separate module that could be used with threads.pm.

You can find it on github at https://github.com/Leont/thread-channel, it will probably be released to cpan as soon as I've written tests for it. I've created a benchmark based on your own\, it's about 30% slower than pipes for simple strings\, but unlike strings can also handle complex datastructures.

Leon

p5pRT commented 11 years ago

From johnh@isi.edu

On Fri\, 30 Aug 2013 20:27:08 +0200\, Leon Timmermans wrote:

On Tue\, Aug 27\, 2013 at 12:11 PM\, Leon Timmermans \fawaka@gmail\.com wrote:

Actually I did write a queue implementation for threads::lite that should be a lot faster for simple data structures\, but I never released it as a separate module that could be used with threads.pm.

You can find it on github at https://github.com/Leont/thread-channel, it will probably be released to cpan as soon as I've written tests for it. I've created a benchmark based on your own\, it's about 30% slower than pipes for simple strings\, but unlike strings can also handle complex datastructures.

Leon

That sounds great. Should it be Thread::Queue::Fast or Thread::Queue::Nonshared?

-John